Compare commits

..

22 Commits

Author SHA1 Message Date
Piotr Wilkin (ilintar) 1430c35948 common/parser: gracefully handle undetected tool parser, print error message. (#20286) 2026-03-13 20:56:10 +01:00
ZeroV0LT f17b3be63f llama : fix pooling assertion crash in chunked GDN detection path (#20468)
* llama : fix pooling assertion crash in chunked GDN detection path

The chunked fused Gated Delta Net detection in sched_reserve() calls
graph_reserve(16*n_seqs, n_seqs, n_outputs, ...) where n_outputs = n_seqs.
This creates a dimension mismatch in build_pooling() for embedding models
with mean/rank pooling: build_inp_mean() creates a tensor with shape
[n_tokens=16*n_seqs, ...] while t_embd is reduced to [n_outputs=n_seqs, ...]
via out_ids, causing ggml_mul_mat to assert on ggml_can_mul_mat(a, b).

Fix: pass n_tokens as n_outputs in the chunked GDN graph reservation,
matching the pattern used by the pp/tg worst-case reservations.

Regression introduced by #20340 (d28961d).
Same class of bug as #12517, fixed by #12545.

* server : add mean pooling tests to embedding test suite

Add test_embedding_pooling_mean and test_embedding_pooling_mean_multiple
to cover the --pooling mean codepath, which was previously untested.

These tests would have caught the regression introduced by #20340 where
build_pooling() crashes with a ggml_mul_mat assertion due to mismatched
dimensions in the chunked GDN detection path.

---------

Co-authored-by: Domenico Crupi <domenico@zerovolt.it>
2026-03-13 20:53:42 +02:00
SoftwareRenderer d7ba99c485 server: reset counter related to kill-switch on client error (#20513)
* server: reset kill-switch on client error

This avoids triggering a server kill switch.

If the client sends a request that exceeds the configured context size, an appropriate HTTP 400 response is provided and no tokens are generated.

However since no tokens are generated, update_slots() increments n_empty_consecutive. If the client sends 3 such messages in a row, the server terminates.

* moved counter reset as per recommendation

* cont : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-13 19:58:09 +02:00
rehan-10xengineer fbaa95bc29 ggml-cpu: add RVV vec dot kernels for quantization types (#18859)
* ggml-cpu: add rvv quantize_row_q8_K kernel

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

* ggml-cpu: add rvv vec_dot for iq4_nl, mxfp4, iq2_xxs

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

* ggml-cpu: add rvv vec_dot for iq4_xs, refactor

* ggml-cpu: remove ifunc for rvv vec dot

* ggml-cpu: add vec_dot for iq2_xs, iq3_xxs

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

* ggml-cpu: refactor quants.c

---------

Co-authored-by: taimur-10x <taimur.ahmad@10xengineers.ai>
Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>
Co-authored-by: Rehan Qasim <rehanbhatti0317@gmail.com>
2026-03-13 17:36:04 +02:00
Adrien Gallouët b5e1212063 ggml : fix typo gmml (#20512)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-13 14:36:13 +01:00
Daniel Bevenius 8f974d2392 mtmd : rename mtmd_get_audio_bitrate to mtmd_get_audio_sample_rate (#20105)
This commit renames the the function `mtmd_get_audio_bitrate` to
`mtmd_get_audio_sample_rate` to better reflect its purpose.

The motivation for this is that the function currently returns the audio
sample rate, not the bitrate (sample_rate × bit_depth × channels), and
that is how it is used in the code as well.

This is a breaking change, but I believe mtmd is still in
experimental/development phase so it might be alright to simply rename.
2026-03-13 12:30:02 +01:00
Piotr Wilkin (ilintar) 2948e6049a general: CONTRIBUTING.md - guidelines for quantization schemes (#19762)
* Guidelines for quantization schemes

* Update CONTRIBUTING.md

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Change required precision from Q8 to FP16/BF16

* Update CONTRIBUTING.md

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update CONTRIBUTING.md

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update CONTRIBUTING.md

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update CONTRIBUTING.md

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update CONTRIBUTING.md [no ci]

* Update CONTRIBUTING.md [no ci]

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-03-13 12:21:33 +01:00
Georgi Gerganov 73c9eb8ced metal : fix l2 norm scale (#20493) 2026-03-13 11:43:20 +02:00
Daniel Bevenius 983df142a9 convert : fix/suppress pyright errors (#20442)
* convert : fix/suppress pyright errors

This commit fixes the pyright errors that are generated by pyright for
convert_hf_to_gguf.py.

The motivation for this is that running this locally generates errors
that CI does not, and it can be difficult to spot new errors. One use
case is when working on new models which cannot be run in CI due to
privacy. Having the ability to run pyright locally is would be helpful
in this cases.

In the linked issue there is the mention of switching to `ty` which I
don't know anything about but in the meantime I would appreciate if we
could suppress these errors for now, and later perhaps revert this
commit.

With this change there are no errors but there are 4 informations
messages if the `mistral_common` package is installed. The
`--level error` flag can be used to suppress them.

Resolves: https://github.com/ggml-org/llama.cpp/issues/20417
2026-03-13 06:00:52 +01:00
Georgi Gerganov 57819b8d4b llama : disable graph reuse with pipeline parallelism (#20463) 2026-03-12 21:04:13 +02:00
Alessandro de Oliveira Faria (A.K.A.CABELO) 557fe2d913 vendor : update cpp-httplib to 0.37.1 (#20390) 2026-03-12 13:57:06 +01:00
Piotr Wilkin (ilintar) 0e810413bb tests : use reasoning instead of reasoning_budget in server tests (#20432) 2026-03-12 13:41:01 +01:00
Ruben Ortlam 128142fe7d test-backend-ops: allow loading tests from file and parsing model operators into file (#19896)
* tests: allow loading test-backend-ops tests from json

* add error threshold based on op

* add error when file cannot be read

* add graph operator json extraction tool

* add nb parameter for non-contiguous input tensors

* fix view check

* only use view if non-contiguous/permuted, use C++ random instead of rand()

* replace internal API calls with public llama_graph_reserve call

* reduce test description length

* fix nb[0] not getting set for view

* add name to tests

* fix inplace error

* use text file instead of json

* move llama_graph_reserve function to new llama-ext header, move export-graph-ops to tests/

* fix missing declaration

* use pragma once

* fix indent

* fix Windows build
2026-03-12 13:26:00 +01:00
Daniel Bevenius 6de1bc631d common : update completion executables list [no ci] (#19934)
This commit updates the bash completion executables list, adding missing
executables and removing some that non longer exist.
2026-03-12 12:12:01 +01:00
Asbjørn Olling 0a10c34dc1 grammar: Fix grammar root symbol check (#19761)
* grammar: fix bad check for root symbol, correct error logging

* add tests to demonstrate root symbol check failure
2026-03-12 12:04:56 +01:00
ProgenyAlpha deee23863b vulkan: add GATED_DELTA_NET op support (#20334)
* vulkan: add GATED_DELTA_NET op support

Implements the fused gated delta net recurrence as a Vulkan compute
shader with full support for scalar gate, KDA vector gate, GQA
broadcast, multi-token sequences, and permuted (non-contiguous) q/k
inputs. Specialization constants select head size (32/64/128) and
KDA mode at pipeline creation time.

Passes all 13 test-backend-ops cases on AMD Radeon 890M (RADV GFX1150).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* vulkan: optimize GATED_DELTA_NET shader (Phase 1)

- vec4 dot products on all inner loops (dp4 hardware intrinsic)
- Cache exp(g) in shared memory for KDA path, eliminating ~32K
  redundant global reads and ~16K redundant exp() calls per token
- vec4 fused decay + rank-1 update (3 vec4 ops vs 12 scalar ops)
- Add perf benchmark cases for GATED_DELTA_NET to test-backend-ops

KDA TG: +5.4% throughput. Non-KDA: no regressions.
13/13 test-backend-ops passing on AMD Radeon 890M (RADV GFX1150).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* vulkan: address review feedback for GATED_DELTA_NET

Pipeline array refactor [3][2], A_TYPE/D_TYPE/FLOAT_TYPE shader macros,
scale in push constants, supports_op fix, dispatch restructuring.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* vulkan: use FLOAT_TYPE for buffer/shared declarations, align formatting

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* vulkan: add explicit FLOAT_TYPE casts for buffer loads

Wrap data_q, data_k, and data_g buffer reads with FLOAT_TYPE() casts
to ensure correct behavior across all Vulkan configurations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* vulkan: fix Q/K broadcast for interleaved head layout

Adapt to the interleaved broadcast convention from #20340:
head_id / rq1 → head_id % neq1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Progeny Alpha <ProgenyAlpha@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 11:32:04 +01:00
Sigbjørn Skjæret c3e3f9e533 convert : better mtp check and fix return [no ci] (#20419) 2026-03-12 10:04:20 +01:00
ProgenyAlpha 40c550d4f6 vulkan: fix SSM_CONV PP scaling with large ubatch sizes (#20379)
* vulkan: optimize SSM_CONV workgroup dispatch for large ubatch

Tile tokens into 2D workgroups (32x16) to reduce workgroup launch
overhead at large ubatch sizes. Add vec4 fast path for nc=4 (common
d_conv size). Fixes PP performance degradation with ubatch > 512.

Ref: ggml-org/llama.cpp#18725

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* vulkan: remove unused shared memory declaration in SSM_CONV

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Progeny Alpha <ProgenyAlpha@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 10:03:18 +01:00
Pascal de190154c8 New conversations now auto-select the first loaded model (#20403)
* webui: auto-select first loaded model for new conversations in router mode

* chore: update webui build output
2026-03-12 09:07:05 +01:00
Masashi Yoshimura 05039967da ggml-virtgpu: Fix some build commands (#20341) 2026-03-12 15:47:45 +08:00
Georgi Gerganov e4cff0956b metal : avoid divisions in bin kernel (#20426)
* metal : avoid modulus in bin kernel when not broadcasting

* metal : fix capture_started flag
2026-03-12 09:42:40 +02:00
Masato Nakasaka 4cc6eb158c ci: Setup self-hosted CI for Intel Linux Vulkan backend (#20154) 2026-03-12 06:43:22 +01:00
41 changed files with 2454 additions and 683 deletions
+16
View File
@@ -1727,6 +1727,22 @@ jobs:
vulkaninfo --summary
GG_BUILD_VULKAN=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-x64-linux-intel-vulkan:
runs-on: [self-hosted, Linux, X64, Intel]
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
with:
persist-credentials: false
- name: Test
id: ggml-ci
run: |
vulkaninfo --summary
GG_BUILD_VULKAN=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-arm64-cpu-kleidiai:
runs-on: ubuntu-22.04-arm
+12 -7
View File
@@ -30,14 +30,19 @@ Before submitting your PR:
- Search for existing PRs to prevent duplicating efforts
- llama.cpp uses the ggml tensor library for model evaluation. If you are unfamiliar with ggml, consider taking a look at the [examples in the ggml repository](https://github.com/ggml-org/ggml/tree/master/examples/). [simple](https://github.com/ggml-org/ggml/tree/master/examples/simple) shows the bare minimum for using ggml. [gpt-2](https://github.com/ggml-org/ggml/tree/master/examples/gpt-2) has minimal implementations for language model inference using GPT-2. [mnist](https://github.com/ggml-org/ggml/tree/master/examples/mnist) demonstrates how to train and evaluate a simple image classifier
- Test your changes:
- Execute [the full CI locally on your machine](ci/README.md) before publishing
- Verify that the perplexity and the performance are not affected negatively by your changes (use `llama-perplexity` and `llama-bench`)
- If you modified the `ggml` source, run the `test-backend-ops` tool to check whether different backend implementations of the `ggml` operators produce consistent results (this requires access to at least two different `ggml` backends)
- If you modified a `ggml` operator or added a new one, add the corresponding test cases to `test-backend-ops`
- Execute [the full CI locally on your machine](ci/README.md) before publishing
- Verify that the perplexity and the performance are not affected negatively by your changes (use `llama-perplexity` and `llama-bench`)
- If you modified the `ggml` source, run the `test-backend-ops` tool to check whether different backend implementations of the `ggml` operators produce consistent results (this requires access to at least two different `ggml` backends)
- If you modified a `ggml` operator or added a new one, add the corresponding test cases to `test-backend-ops`
- Create separate PRs for each feature or fix:
- Avoid combining unrelated changes in a single PR
- For intricate features, consider opening a feature request first to discuss and align expectations
- When adding support for a new model or feature, focus on **CPU support only** in the initial PR unless you have a good reason not to. Add support for other backends like CUDA in follow-up PRs
- Avoid combining unrelated changes in a single PR
- For intricate features, consider opening a feature request first to discuss and align expectations
- When adding support for a new model or feature, focus on **CPU support only** in the initial PR unless you have a good reason not to. Add support for other backends like CUDA in follow-up PRs
- In particular, adding new data types (extension of the `ggml_type` enum) carries with it a disproportionate maintenance burden. As such, to add a new quantization type you will need to meet the following *additional* criteria *at minimum*:
- convert a small model to GGUF using the new type and upload it to HuggingFace
- provide [perplexity](https://github.com/ggml-org/llama.cpp/tree/master/tools/perplexity) comparisons to FP16/BF16 (whichever is the native precision) as well as to types of similar size
- provide KL divergence data calculated vs. the FP16/BF16 (whichever is the native precision) version for both the new type as well as types of similar size
- provide [performance data](https://github.com/ggml-org/llama.cpp/tree/master/tools/llama-bench) for the new type in comparison to types of similar size on pure CPU
- Consider allowing write access to your branch for faster reviews, as reviewers can push commits directly
- If you are a new contributor, limit your open PRs to 1.
+11 -5
View File
@@ -732,23 +732,28 @@ static void common_params_print_completion(common_params_context & ctx_arg) {
"llama-completion",
"llama-convert-llama2c-to-ggml",
"llama-cvector-generator",
"llama-debug",
"llama-diffusion-cli",
"llama-embedding",
"llama-eval-callback",
"llama-export-lora",
"llama-finetune",
"llama-fit-params",
"llama-gemma3-cli",
"llama-gen-docs",
"llama-gguf",
"llama-gguf-hash",
"llama-gguf-split",
"llama-gritlm",
"llama-idle",
"llama-imatrix",
"llama-infill",
"llama-mtmd-cli",
"llama-llava-clip-quantize-cli",
"llama-llava-cli",
"llama-lookahead",
"llama-lookup",
"llama-lookup-create",
"llama-lookup-merge",
"llama-lookup-stats",
"llama-minicpmv-cli",
"llama-mtmd-cli",
"llama-parallel",
"llama-passkey",
"llama-perplexity",
@@ -2666,7 +2671,8 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
[](common_params & params, const std::string & value) {
params.out_file = value;
}
).set_examples({LLAMA_EXAMPLE_IMATRIX, LLAMA_EXAMPLE_CVECTOR_GENERATOR, LLAMA_EXAMPLE_EXPORT_LORA, LLAMA_EXAMPLE_TTS, LLAMA_EXAMPLE_FINETUNE, LLAMA_EXAMPLE_RESULTS}));
).set_examples({LLAMA_EXAMPLE_IMATRIX, LLAMA_EXAMPLE_CVECTOR_GENERATOR, LLAMA_EXAMPLE_EXPORT_LORA, LLAMA_EXAMPLE_TTS, LLAMA_EXAMPLE_FINETUNE,
LLAMA_EXAMPLE_RESULTS, LLAMA_EXAMPLE_EXPORT_GRAPH_OPS}));
add_opt(common_arg(
{"-ofreq", "--output-frequency"}, "N",
string_format("output the imatrix every N iterations (default: %d)", params.n_out_freq),
+5 -1
View File
@@ -3,6 +3,7 @@
#include "chat.h"
#include "common.h"
#include "json-schema-to-grammar.h"
#include "log.h"
#include "nlohmann/json.hpp"
#include <stdexcept>
@@ -182,7 +183,10 @@ common_peg_parser analyze_tools::build_parser(parser_build_context & ctx) const
case tool_format::TAG_WITH_TAGGED:
return build_tool_parser_tag_tagged(ctx);
default:
GGML_ABORT("Unable to create tool parser");
LOG_ERR("[ERROR] Template seems to support tool calls, but failed to determine tool format. Tool calling will not work properly. "
"Check for a fixed template for your model in the models/templates directory of your llama.cpp installation or "
"report an issue at https://github.com/ggml-org/llama.cpp/issues\n");
return ctx.p.eps();
}
}
+1
View File
@@ -105,6 +105,7 @@ enum llama_example {
LLAMA_EXAMPLE_FINETUNE,
LLAMA_EXAMPLE_FIT_PARAMS,
LLAMA_EXAMPLE_RESULTS,
LLAMA_EXAMPLE_EXPORT_GRAPH_OPS,
LLAMA_EXAMPLE_COUNT,
};
+12 -2
View File
@@ -2194,6 +2194,8 @@ class GPTNeoXModel(TextModel):
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
n_head = self.hparams.get("n_head", self.hparams.get("num_attention_heads"))
n_embed = self.hparams.get("hidden_size", self.hparams.get("n_embed"))
assert n_head is not None
assert n_embed is not None
if re.match(r"gpt_neox\.layers\.\d+\.attention\.query_key_value\.weight", name):
# Map bloom-style qkv_linear to gpt-style qkv_linear
@@ -2231,6 +2233,8 @@ class BloomModel(TextModel):
def set_gguf_parameters(self):
n_embed = self.hparams.get("hidden_size", self.hparams.get("n_embed"))
n_head = self.hparams.get("n_head", self.hparams.get("num_attention_heads"))
assert n_head is not None
assert n_embed is not None
self.gguf_writer.add_context_length(self.hparams.get("seq_length", n_embed))
self.gguf_writer.add_embedding_length(n_embed)
self.gguf_writer.add_feed_forward_length(4 * n_embed)
@@ -2243,6 +2247,8 @@ class BloomModel(TextModel):
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
n_head = self.hparams.get("n_head", self.hparams.get("num_attention_heads"))
n_embed = self.hparams.get("hidden_size", self.hparams.get("n_embed"))
assert n_head is not None
assert n_embed is not None
name = re.sub(r'transformer\.', '', name)
@@ -3853,6 +3859,7 @@ class LLaDAModel(TextModel):
if (rope_dim := hparams.get("head_dim")) is None:
n_heads = hparams.get("num_attention_heads", hparams.get("n_heads"))
assert n_heads is not None
rope_dim = hparams.get("hidden_size", hparams.get("d_model")) // n_heads
self.gguf_writer.add_rope_dimension_count(rope_dim)
@@ -3884,6 +3891,7 @@ class LLaDAModel(TextModel):
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
n_head = self.hparams.get("num_attention_heads", self.hparams.get("n_heads"))
assert n_head is not None
n_kv_head = self.hparams.get("num_key_value_heads", self.hparams.get("n_kv_heads"))
if self.undo_permute:
@@ -9485,7 +9493,9 @@ class ChatGLMModel(TextModel):
def set_gguf_parameters(self):
n_embed = self.hparams.get("hidden_size", self.hparams.get("n_embed"))
assert n_embed is not None
n_head = self.hparams.get("n_head", self.hparams.get("num_attention_heads"))
assert n_head is not None
n_head_kv = self.hparams.get("multi_query_group_num", self.hparams.get("num_key_value_heads", n_head))
self.gguf_writer.add_context_length(self.hparams.get("seq_length", n_embed))
self.gguf_writer.add_embedding_length(n_embed)
@@ -10092,9 +10102,9 @@ class NemotronHModel(GraniteHybridModel):
# Skip Multi-Token Prediction (MTP) tensors. These are used for
# for speculative decoding but we don't include them in this model
# conversion. See https://github.com/ggml-org/llama.cpp/pull/18886
if "mtp" in name:
if name.startswith("mtp."):
logger.info(f"gguf: Skipping MTP (Speculative) layer: {name}")
return []
return
if name.endswith("mixer.gate.e_score_correction_bias"):
new_name = name.replace("e_score_correction_bias", "e_score_correction.bias")
+4 -2
View File
@@ -55,7 +55,8 @@ LLAMA_MAC_BUILD=$PWD/build/ggml-virtgpu-backend
cmake -S . -B $LLAMA_MAC_BUILD \
-DGGML_NATIVE=OFF \
-DLLAMA_CURL=ON \
-DGGML_REMOTINGBACKEND=ONLY \
-DGGML_VIRTGPU=ON \
-DGGML_VIRTGPU_BACKEND=ONLY \
-DGGML_METAL=ON
TARGETS="ggml-metal"
@@ -71,6 +72,7 @@ cmake --build $LLAMA_MAC_BUILD --parallel 8 --target $EXTRA_TARGETS
```bash
# Build virglrenderer with APIR support
mkdir virglrenderer
cd virglrenderer
git clone https://gitlab.freedesktop.org/kpouget/virglrenderer -b main-macos src
cd src
@@ -95,7 +97,7 @@ mkdir llama.cpp
git clone https://github.com/ggml-org/llama.cpp.git src
cd src
LLAMA_LINUX_BUILD=$PWD//build-virtgpu
LLAMA_LINUX_BUILD=$PWD/build-virtgpu
cmake -S . -B $LLAMA_LINUX_BUILD \
-DGGML_VIRTGPU=ON
+1 -1
View File
@@ -253,7 +253,7 @@ option(GGML_OPENCL_PROFILING "ggml: use OpenCL profiling (increas
option(GGML_OPENCL_EMBED_KERNELS "ggml: embed kernels" ON)
option(GGML_OPENCL_USE_ADRENO_KERNELS "ggml: use optimized kernels for Adreno" ON)
set (GGML_OPENCL_TARGET_VERSION "300" CACHE STRING
"gmml: OpenCL API version to target")
"ggml: OpenCL API version to target")
option(GGML_HEXAGON "ggml: enable Hexagon backend" OFF)
set(GGML_HEXAGON_FP32_QUANTIZE_GROUP_SIZE 128 CACHE STRING "ggml: quantize group size (32, 64, or 128)")
+5 -9
View File
@@ -1455,10 +1455,6 @@ static enum ggml_status ggml_backend_sched_compute_splits(ggml_backend_sched_t s
int split_backend_id = split->backend_id;
ggml_backend_t split_backend = sched->backends[split_backend_id];
if (sched->events[split_backend_id][sched->cur_copy] == NULL) {
ggml_backend_synchronize(split_backend);
}
// copy the input tensors to the split backend
for (int input_id = 0; input_id < split->n_inputs; input_id++) {
ggml_backend_t input_backend = ggml_backend_sched_get_tensor_backend(sched, split->inputs[input_id]);
@@ -1469,12 +1465,16 @@ static enum ggml_status ggml_backend_sched_compute_splits(ggml_backend_sched_t s
// inputs from the user must be copied immediately to prevent the user overwriting the data before the copy is done
if (sched->events[split_backend_id][sched->cur_copy] != NULL) {
ggml_backend_event_synchronize(sched->events[split_backend_id][sched->cur_copy]);
} else {
ggml_backend_synchronize(split_backend);
}
ggml_backend_tensor_copy_async(input_backend, split_backend, input, input_cpy);
ggml_backend_tensor_copy(input, input_cpy);
} else {
// wait for the split backend to finish using the input before overwriting it
if (sched->events[split_backend_id][sched->cur_copy] != NULL) {
ggml_backend_event_wait(split_backend, sched->events[split_backend_id][sched->cur_copy]);
} else {
ggml_backend_synchronize(split_backend);
}
// when offloading MoE weights, we can reduce the amount of data copied by copying only the experts that are used
@@ -1578,10 +1578,6 @@ static enum ggml_status ggml_backend_sched_compute_splits(ggml_backend_sched_t s
}
}
if (sched->events[split_backend_id][sched->cur_copy] == NULL) {
ggml_backend_synchronize(split_backend);
}
if (!sched->callback_eval) {
enum ggml_status ec = ggml_backend_graph_compute_async(split_backend, &split->graph);
if (ec != GGML_STATUS_SUCCESS) {
-7
View File
@@ -199,13 +199,6 @@
#define ggml_gemm_q8_0_4x8_q8_0_generic ggml_gemm_q8_0_4x8_q8_0
#elif defined(__riscv)
// quants.c
#define quantize_row_q8_K_generic quantize_row_q8_K
#define ggml_vec_dot_iq2_xxs_q8_K_generic ggml_vec_dot_iq2_xxs_q8_K
#define ggml_vec_dot_iq2_xs_q8_K_generic ggml_vec_dot_iq2_xs_q8_K
#define ggml_vec_dot_iq3_xxs_q8_K_generic ggml_vec_dot_iq3_xxs_q8_K
#define ggml_vec_dot_iq4_nl_q8_0_generic ggml_vec_dot_iq4_nl_q8_0
#define ggml_vec_dot_iq4_xs_q8_K_generic ggml_vec_dot_iq4_xs_q8_K
#define ggml_vec_dot_mxfp4_q8_0_generic ggml_vec_dot_mxfp4_q8_0
#define ggml_vec_dot_nvfp4_q8_0_generic ggml_vec_dot_nvfp4_q8_0
// repack.cpp
#define ggml_quantize_mat_q8_0_4x1_generic ggml_quantize_mat_q8_0_4x1
File diff suppressed because it is too large Load Diff
+1 -1
View File
@@ -9624,7 +9624,7 @@ void ggml_compute_forward_win_unpart(
}
}
//gmml_compute_forward_unary
//ggml_compute_forward_unary
void ggml_compute_forward_unary(
const ggml_compute_params * params,
+4 -10
View File
@@ -2823,14 +2823,11 @@ static bool ggml_backend_cuda_cpy_tensor_async(ggml_backend_t backend_src, ggml_
ggml_backend_buffer_t buf_src = src->view_src ? src->view_src->buffer : src->buffer;
ggml_backend_buffer_t buf_dst = dst->view_src ? dst->view_src->buffer : dst->buffer;
//enables async copies from CPU to CUDA, instead of only CUDA-to-CUDA
bool copy_from_host = ggml_backend_buffer_is_host(buf_src) && ggml_backend_dev_type(backend_src->device) == GGML_BACKEND_DEVICE_TYPE_CPU;
if (!(copy_from_host || ggml_backend_is_cuda(backend_src)) || !ggml_backend_is_cuda(backend_dst)) {
if (!ggml_backend_is_cuda(backend_src) || !ggml_backend_is_cuda(backend_dst)) {
return false;
}
if (!(copy_from_host || ggml_backend_buffer_is_cuda(buf_src)) || !ggml_backend_buffer_is_cuda(dst->buffer)) {
if (!ggml_backend_buffer_is_cuda(src->buffer) || !ggml_backend_buffer_is_cuda(dst->buffer)) {
return false;
}
@@ -2841,17 +2838,14 @@ static bool ggml_backend_cuda_cpy_tensor_async(ggml_backend_t backend_src, ggml_
ggml_backend_cuda_buffer_context * buf_ctx_src = (ggml_backend_cuda_buffer_context *)buf_src->context;
ggml_backend_cuda_buffer_context * buf_ctx_dst = (ggml_backend_cuda_buffer_context *)buf_dst->context;
if ((copy_from_host && cuda_ctx_dst->device != buf_ctx_dst->device) ||
!copy_from_host && (cuda_ctx_src->device != buf_ctx_src->device || cuda_ctx_dst->device != buf_ctx_dst->device)) {
if (cuda_ctx_src->device != buf_ctx_src->device || cuda_ctx_dst->device != buf_ctx_dst->device) {
#ifndef NDEBUG
GGML_LOG_DEBUG("%s: backend and buffer devices do not match\n", __func__);
#endif
return false;
}
if (copy_from_host) {
CUDA_CHECK(cudaMemcpyAsync(dst->data, src->data, ggml_nbytes(dst), cudaMemcpyHostToDevice, cuda_ctx_dst->stream()));
} else if (backend_src != backend_dst) {
if (backend_src != backend_dst) {
// copy on src stream
if (cuda_ctx_src->device == cuda_ctx_dst->device) {
CUDA_CHECK(cudaMemcpyAsync(dst->data, src->data, ggml_nbytes(dst), cudaMemcpyDeviceToDevice, cuda_ctx_src->stream()));
+3 -1
View File
@@ -554,7 +554,7 @@ enum ggml_status ggml_metal_graph_compute(ggml_metal_t ctx, struct ggml_cgraph *
// enter here only when capturing in order to wait for all computation to finish
// otherwise, we leave the graph to compute asynchronously
if (!use_capture && ctx->capture_started) {
if (use_capture && ctx->capture_started) {
// wait for completion and check status of each command buffer
// needed to detect if the device ran out-of-memory for example (#1881)
{
@@ -606,6 +606,8 @@ enum ggml_status ggml_metal_graph_compute(ggml_metal_t ctx, struct ggml_cgraph *
[ctx->capture_scope endScope];
[[MTLCaptureManager sharedCaptureManager] stopCapture];
ctx->capture_started = false;
}
}
+3 -1
View File
@@ -1470,10 +1470,11 @@ ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_bin(ggml_metal_l
const bool is_c4 = (op->src[0]->ne[0] % 4 == 0) && (op->src[1]->ne[0] % 4 == 0);
const bool is_cb = op->src[0]->ne[0] != op->src[1]->ne[0];
const bool is_rb = ggml_is_contiguous(op->src[0]) && ggml_is_contiguous(op->src[1]) && (ggml_nrows(op->src[1]) == 1) && ggml_nelements(op) < 65536;
snprintf(base, 256, "kernel_bin_fuse_%s_%s_%s%s", t0_str, t1_str, t_str, is_c4 ? "_4" : "");
snprintf(name, 256, "%s_op=%d_nf=%d_rb=%d", base, op_num, n_fuse, is_rb);
snprintf(name, 256, "%s_op=%d_nf=%d_rb=%d_cb=%d", base, op_num, n_fuse, is_rb, is_cb);
ggml_metal_pipeline_with_params res = ggml_metal_library_get_pipeline(lib, name);
if (!res.pipeline) {
@@ -1482,6 +1483,7 @@ ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_bin(ggml_metal_l
ggml_metal_cv_set_int16(cv, op_num, FC_BIN + 0);
ggml_metal_cv_set_int16(cv, n_fuse, FC_BIN + 1);
ggml_metal_cv_set_bool (cv, is_rb, FC_BIN + 2);
ggml_metal_cv_set_bool (cv, is_cb, FC_BIN + 3);
res = ggml_metal_library_compile_pipeline(lib, base, name, cv);
+1 -1
View File
@@ -1156,7 +1156,7 @@ bool ggml_metal_device_supports_op(ggml_metal_device_t dev, const struct ggml_te
case GGML_OP_RWKV_WKV7:
return true;
case GGML_OP_GATED_DELTA_NET:
return op->src[2]->ne[0] % 32 == 0;
return has_simdgroup_reduction && op->src[2]->ne[0] % 32 == 0;
case GGML_OP_SOLVE_TRI:
case GGML_OP_MUL_MAT:
case GGML_OP_MUL_MAT_ID:
+1 -3
View File
@@ -3180,9 +3180,7 @@ int ggml_metal_op_bin(ggml_metal_op_t ctx, int idx) {
ggml_metal_encoder_set_buffer (enc, bid_dst, 3);
if (pipeline.cnt) {
const int n = pipeline.c4 ? ggml_nelements(op)/4 : ggml_nelements(op);
ggml_metal_encoder_dispatch_threadgroups(enc, n, 1, 1, 1, 1, 1);
ggml_metal_encoder_dispatch_threadgroups(enc, args.ne0, ggml_nrows(op), 1, 1, 1, 1);
} else {
const int nth_max = MIN(256, ggml_metal_pipeline_max_theads_per_threadgroup(pipeline));
+8 -5
View File
@@ -1111,6 +1111,7 @@ template [[host_name("kernel_unary_f16_f16_4")]] kernel kernel_unary_t kernel_un
constant short FC_bin_op [[function_constant(FC_BIN + 0)]];
constant short FC_bin_f [[function_constant(FC_BIN + 1)]];
constant bool FC_bin_rb [[function_constant(FC_BIN + 2)]];
constant bool FC_bin_cb [[function_constant(FC_BIN + 3)]];
template <typename T0, typename T1, typename T>
kernel void kernel_bin_fuse_impl(
@@ -1124,11 +1125,12 @@ kernel void kernel_bin_fuse_impl(
#define FC_OP FC_bin_op
#define FC_F FC_bin_f
#define FC_RB FC_bin_rb
#define FC_CB FC_bin_cb
if (FC_RB) {
// row broadcast
const uint i0 = tgpig.x;
const uint i1 = i0%args.ne10;
const uint i0 = tgpig.y*args.ne00 + tgpig.x;
const uint i1 = FC_CB ? tgpig.x%args.ne10 : tgpig.x;
device const T0 * src0_row = (device const T0 *) (src0);
device T * dst_row = (device T *) (dst);
@@ -1200,7 +1202,7 @@ kernel void kernel_bin_fuse_impl(
device const T1 * src1_ptr = (device const T1 *) (src1 + args.o1[0] + i13*args.nb13 + i12*args.nb12 + i11*args.nb11);
for (int i0 = tpitg.x; i0 < args.ne0; i0 += ntg.x) {
const int i10 = i0%args.ne10;
const int i10 = FC_CB ? i0%args.ne10 : i0;
if (FC_OP == 0) {
dst_ptr[i0] = src0_ptr[i0] + src1_ptr[i10];
@@ -1225,7 +1227,7 @@ kernel void kernel_bin_fuse_impl(
}
for (int i0 = tpitg.x; i0 < args.ne0; i0 += ntg.x) {
const int i10 = i0%args.ne10;
const int i10 = FC_CB ? i0%args.ne10 : i0;
T res = src0_ptr[i0];
@@ -1261,6 +1263,7 @@ kernel void kernel_bin_fuse_impl(
#undef FC_OP
#undef FC_F
#undef FC_RB
#undef FC_CB
}
typedef decltype(kernel_bin_fuse_impl<float, float, float>) kernel_bin_fuse_t;
@@ -3003,7 +3006,7 @@ kernel void kernel_l2_norm_impl(
sumf = shmem_f32[tiisg];
sumf = simd_sum(sumf);
const float scale = 1.0f/sqrt(max(sumf, args.eps));
const float scale = 1.0f/max(sqrt(sumf), args.eps);
for (int i00 = tpitg.x; i00 < args.ne00; i00 += ntg.x) {
y[i00] = x[i00] * scale;
+120 -1
View File
@@ -825,6 +825,8 @@ struct vk_device_struct {
vk_pipeline pipeline_pool2d_f32;
vk_pipeline pipeline_rwkv_wkv6_f32;
vk_pipeline pipeline_rwkv_wkv7_f32;
// [size_idx][kda] where size_idx: 0=d32, 1=d64, 2=d128
vk_pipeline pipeline_gated_delta_net[3][2];
vk_pipeline pipeline_ssm_scan_f32_d128;
vk_pipeline pipeline_ssm_scan_f32_d256;
vk_pipeline pipeline_ssm_conv_f32;
@@ -1454,6 +1456,18 @@ struct vk_op_rwkv_wkv7_push_constants {
uint32_t C;
uint32_t H;
};
struct vk_op_gated_delta_net_push_constants {
uint32_t H;
uint32_t n_tokens;
uint32_t n_seqs;
uint32_t s_off;
uint32_t sq1, sq2, sq3;
uint32_t sv1, sv2, sv3;
uint32_t sb1, sb2, sb3;
uint32_t neq1, rq3;
float scale;
};
struct vk_op_ssm_scan_push_constants {
uint32_t nb02, nb03, nb12, nb13;
uint32_t nb21, nb22, nb31;
@@ -4568,6 +4582,23 @@ static void ggml_vk_load_shaders(vk_device& device) {
ggml_vk_create_pipeline(device, device->pipeline_rwkv_wkv7_f32, "rwkv_wkv7_f32", rwkv_wkv7_f32_len, rwkv_wkv7_f32_data, "main", 8, sizeof(vk_op_rwkv_wkv7_push_constants), {1, 1, 1}, {device->subgroup_size}, 1);
{
const uint32_t gdn_sizes[] = {32, 64, 128};
const char * gdn_names[][2] = {
{"gated_delta_net_f32_d32", "gated_delta_net_f32_d32_kda"},
{"gated_delta_net_f32_d64", "gated_delta_net_f32_d64_kda"},
{"gated_delta_net_f32_d128", "gated_delta_net_f32_d128_kda"},
};
for (uint32_t si = 0; si < 3; si++) {
for (uint32_t kda = 0; kda < 2; kda++) {
ggml_vk_create_pipeline(device, device->pipeline_gated_delta_net[si][kda],
gdn_names[si][kda], gated_delta_net_f32_len, gated_delta_net_f32_data,
"main", 7, sizeof(vk_op_gated_delta_net_push_constants),
{1, 1, 1}, {gdn_sizes[si], kda}, 1);
}
}
}
if (device->subgroup_arithmetic && device->subgroup_require_full_support) {
ggml_vk_create_pipeline(device, device->pipeline_ssm_scan_f32_d128, "ssm_scan_128_f32", ssm_scan_subgroup_f32_len, ssm_scan_subgroup_f32_data, "main", 8, sizeof(vk_op_ssm_scan_push_constants), {1, 1, 1}, {128, device->subgroup_size}, 1, true, true);
ggml_vk_create_pipeline(device, device->pipeline_ssm_scan_f32_d256, "ssm_scan_256_f32", ssm_scan_subgroup_f32_len, ssm_scan_subgroup_f32_data, "main", 8, sizeof(vk_op_ssm_scan_push_constants), {1, 1, 1}, {256, device->subgroup_size}, 1, true, true);
@@ -4576,7 +4607,7 @@ static void ggml_vk_load_shaders(vk_device& device) {
ggml_vk_create_pipeline(device, device->pipeline_ssm_scan_f32_d256, "ssm_scan_256_f32", ssm_scan_f32_len, ssm_scan_f32_data, "main", 8, sizeof(vk_op_ssm_scan_push_constants), {1, 1, 1}, {256, device->subgroup_size, 16}, 1, true, true);
}
ggml_vk_create_pipeline(device, device->pipeline_ssm_conv_f32, "ssm_conv_f32", ssm_conv_f32_len, ssm_conv_f32_data, "main", 3, sizeof(vk_op_ssm_conv_push_constants), {32, 1, 1}, {32}, 1);
ggml_vk_create_pipeline(device, device->pipeline_ssm_conv_f32, "ssm_conv_f32", ssm_conv_f32_len, ssm_conv_f32_data, "main", 3, sizeof(vk_op_ssm_conv_push_constants), {32, 16, 1}, {32, 16}, 1);
ggml_vk_create_pipeline(device, device->pipeline_opt_step_adamw_f32, "opt_step_adamw_f32", opt_step_adamw_f32_len, opt_step_adamw_f32_data, "main", 5, sizeof(vk_op_push_constants), {512, 1, 1}, {}, 1);
@@ -9498,6 +9529,20 @@ static vk_pipeline ggml_vk_op_get_pipeline(ggml_backend_vk_context * ctx, const
return ctx->device->pipeline_rwkv_wkv7_f32;
}
return nullptr;
case GGML_OP_GATED_DELTA_NET:
if (src0->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32) {
const uint32_t S_v = dst->src[2]->ne[0];
const uint32_t kda = (dst->src[3]->ne[0] == (int64_t)S_v) ? 1 : 0;
uint32_t si;
switch (S_v) {
case 32: si = 0; break;
case 64: si = 1; break;
case 128: si = 2; break;
default: return nullptr;
}
return ctx->device->pipeline_gated_delta_net[si][kda];
}
return nullptr;
case GGML_OP_SSM_SCAN:
if (src0->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32) {
const uint32_t d_state = src0->ne[0];
@@ -10328,6 +10373,59 @@ static void ggml_vk_rwkv_wkv7(ggml_backend_vk_context * ctx, vk_context& subctx,
);
}
static void ggml_vk_gated_delta_net(ggml_backend_vk_context * ctx, vk_context& subctx, ggml_tensor * dst) {
const ggml_tensor * src_q = dst->src[0];
const ggml_tensor * src_v = dst->src[2];
const ggml_tensor * src_beta = dst->src[4];
GGML_ASSERT(dst->buffer != nullptr);
const uint32_t S_v = (uint32_t)src_v->ne[0];
const uint32_t H = (uint32_t)src_v->ne[1];
const uint32_t n_tokens = (uint32_t)src_v->ne[2];
const uint32_t n_seqs = (uint32_t)src_v->ne[3];
const uint32_t s_off = S_v * H * n_tokens * n_seqs;
vk_pipeline pipeline = ggml_vk_op_get_pipeline(ctx, dst->src[0], dst->src[1], dst->src[2], dst, dst->op);
GGML_ASSERT(pipeline != nullptr);
ggml_pipeline_request_descriptor_sets(ctx, pipeline, 1);
vk_subbuffer dst_buf = ggml_vk_tensor_subbuffer(ctx, dst);
vk_subbuffer src_buf[6] = {};
for (int i = 0; i < 6; i++) {
src_buf[i] = ggml_vk_tensor_subbuffer(ctx, dst->src[i]);
}
const uint32_t sq1 = (uint32_t)(src_q->nb[1] / sizeof(float));
const uint32_t sq2 = (uint32_t)(src_q->nb[2] / sizeof(float));
const uint32_t sq3 = (uint32_t)(src_q->nb[3] / sizeof(float));
const uint32_t sv1 = (uint32_t)(src_v->nb[1] / sizeof(float));
const uint32_t sv2 = (uint32_t)(src_v->nb[2] / sizeof(float));
const uint32_t sv3 = (uint32_t)(src_v->nb[3] / sizeof(float));
const uint32_t sb1 = (uint32_t)(src_beta->nb[1] / sizeof(float));
const uint32_t sb2 = (uint32_t)(src_beta->nb[2] / sizeof(float));
const uint32_t sb3 = (uint32_t)(src_beta->nb[3] / sizeof(float));
const uint32_t neq1 = (uint32_t)src_q->ne[1];
const uint32_t rq3 = (uint32_t)(src_v->ne[3] / src_q->ne[3]);
const float scale = 1.0f / sqrtf((float)S_v);
const vk_op_gated_delta_net_push_constants pc = {
H, n_tokens, n_seqs, s_off,
sq1, sq2, sq3,
sv1, sv2, sv3,
sb1, sb2, sb3,
neq1, rq3,
scale
};
ggml_vk_dispatch_pipeline(ctx, subctx, pipeline,
{src_buf[0], src_buf[1], src_buf[2], src_buf[3], src_buf[4], src_buf[5], dst_buf},
pc, { H, n_seqs, 1u });
}
static void ggml_vk_ssm_scan(ggml_backend_vk_context * ctx, vk_context& subctx, ggml_tensor * dst) {
const ggml_tensor * src0 = dst->src[0];
const ggml_tensor * src1 = dst->src[1];
@@ -13044,6 +13142,11 @@ static bool ggml_vk_build_graph(ggml_backend_vk_context * ctx, ggml_cgraph * cgr
break;
case GGML_OP_GATED_DELTA_NET:
ggml_vk_gated_delta_net(ctx, compute_ctx, node);
break;
case GGML_OP_SSM_SCAN:
ggml_vk_ssm_scan(ctx, compute_ctx, node);
@@ -15455,6 +15558,19 @@ static bool ggml_backend_vk_device_supports_op(ggml_backend_dev_t dev, const ggm
case GGML_OP_RWKV_WKV6:
case GGML_OP_RWKV_WKV7:
return true; // all inputs are contiguous, see ggml.c
case GGML_OP_GATED_DELTA_NET:
{
const uint32_t S_v = op->src[2]->ne[0];
if (S_v != 32 && S_v != 64 && S_v != 128) {
return false;
}
for (int i = 0; i < 6; i++) {
if (op->src[i] == nullptr || op->src[i]->type != GGML_TYPE_F32) {
return false;
}
}
return op->type == GGML_TYPE_F32;
}
case GGML_OP_SSM_SCAN:
{
for (int i = 0; i < 6; i++) {
@@ -16332,6 +16448,9 @@ static void ggml_vk_check_results_0(ggml_backend_vk_context * ctx, ggml_cgraph *
} else if (tensor->op == GGML_OP_RWKV_WKV7) {
tensor_clone = ggml_rwkv_wkv7(ggml_ctx, src_clone[0], src_clone[1], src_clone[2], src_clone[3],
src_clone[4], src_clone[5], src_clone[6]);
} else if (tensor->op == GGML_OP_GATED_DELTA_NET) {
tensor_clone = ggml_gated_delta_net(ggml_ctx, src_clone[0], src_clone[1],
src_clone[2], src_clone[3], src_clone[4], src_clone[5]);
} else if (tensor->op == GGML_OP_OPT_STEP_ADAMW) {
src_clone[0]->flags = tensor->src[0]->flags;
tensor_clone = ggml_opt_step_adamw(ggml_ctx, src_clone[0], src_clone[1],
@@ -0,0 +1,128 @@
#version 450
#extension GL_EXT_control_flow_attributes : require
layout(constant_id = 0) const uint S_V = 128;
layout(constant_id = 1) const uint KDA = 0;
layout(local_size_x_id = 0, local_size_y = 1, local_size_z = 1) in;
layout(push_constant) uniform Parameters {
uint H;
uint n_tokens;
uint n_seqs;
uint s_off;
uint sq1, sq2, sq3;
uint sv1, sv2, sv3;
uint sb1, sb2, sb3;
uint neq1, rq3;
float scale;
};
layout(binding = 0) readonly buffer QBuf { FLOAT_TYPE data_q[]; };
layout(binding = 1) readonly buffer KBuf { FLOAT_TYPE data_k[]; };
layout(binding = 2) readonly buffer VBuf { FLOAT_TYPE data_v[]; };
layout(binding = 3) readonly buffer GBuf { FLOAT_TYPE data_g[]; };
layout(binding = 4) readonly buffer BetaBuf { FLOAT_TYPE data_beta[]; };
layout(binding = 5) readonly buffer StateBuf { FLOAT_TYPE data_state[]; };
layout(binding = 6) buffer DstBuf { FLOAT_TYPE data_dst[]; };
shared FLOAT_TYPE s_k[S_V];
shared FLOAT_TYPE s_q[S_V];
shared FLOAT_TYPE s_g[S_V]; // KDA only: cached exp(g[i])
void main() {
const uint head_id = gl_WorkGroupID.x;
const uint seq_id = gl_WorkGroupID.y;
const uint col = gl_LocalInvocationID.x;
const uint iq1 = head_id % neq1;
const uint iq3 = seq_id / rq3;
const uint state_size = S_V * S_V;
const uint state_base = (seq_id * H + head_id) * state_size;
FLOAT_TYPE state[S_V];
[[unroll]] for (uint i = 0; i < S_V; i++) {
state[i] = FLOAT_TYPE(data_state[state_base + i * S_V + col]);
}
uint attn_off = (seq_id * n_tokens * H + head_id) * S_V;
for (uint t = 0; t < n_tokens; t++) {
const uint q_off = iq3 * sq3 + t * sq2 + iq1 * sq1;
const uint k_off = q_off;
const uint v_off = seq_id * sv3 + t * sv2 + head_id * sv1;
s_q[col] = FLOAT_TYPE(data_q[q_off + col]);
s_k[col] = FLOAT_TYPE(data_k[k_off + col]);
const uint gb_off = seq_id * sb3 + t * sb2 + head_id * sb1;
if (KDA != 0) {
const uint g_base = gb_off * S_V;
s_g[col] = exp(FLOAT_TYPE(data_g[g_base + col]));
}
barrier();
const FLOAT_TYPE v_val = FLOAT_TYPE(data_v[v_off + col]);
const FLOAT_TYPE beta_val = FLOAT_TYPE(data_beta[gb_off]);
if (KDA == 0) {
const FLOAT_TYPE g_val = exp(FLOAT_TYPE(data_g[gb_off]));
FLOAT_TYPE kv_col = 0.0;
[[unroll]] for (uint i = 0; i < S_V; i += 4) {
kv_col += dot(
vec4(state[i], state[i+1], state[i+2], state[i+3]),
vec4(s_k[i], s_k[i+1], s_k[i+2], s_k[i+3])
);
}
FLOAT_TYPE delta_col = (v_val - g_val * kv_col) * beta_val;
FLOAT_TYPE attn_col = 0.0;
[[unroll]] for (uint i = 0; i < S_V; i += 4) {
vec4 sv = vec4(state[i], state[i+1], state[i+2], state[i+3]);
vec4 kv = vec4(s_k[i], s_k[i+1], s_k[i+2], s_k[i+3]);
sv = g_val * sv + kv * delta_col;
state[i] = sv.x; state[i+1] = sv.y; state[i+2] = sv.z; state[i+3] = sv.w;
attn_col += dot(sv, vec4(s_q[i], s_q[i+1], s_q[i+2], s_q[i+3]));
}
data_dst[attn_off + col] = attn_col * scale;
} else {
FLOAT_TYPE kv_col = 0.0;
[[unroll]] for (uint i = 0; i < S_V; i += 4) {
vec4 gv = vec4(s_g[i], s_g[i+1], s_g[i+2], s_g[i+3]);
vec4 sv = vec4(state[i], state[i+1], state[i+2], state[i+3]);
vec4 kv = vec4(s_k[i], s_k[i+1], s_k[i+2], s_k[i+3]);
kv_col += dot(gv * sv, kv);
}
FLOAT_TYPE delta_col = (v_val - kv_col) * beta_val;
FLOAT_TYPE attn_col = 0.0;
[[unroll]] for (uint i = 0; i < S_V; i += 4) {
vec4 gv = vec4(s_g[i], s_g[i+1], s_g[i+2], s_g[i+3]);
vec4 sv = vec4(state[i], state[i+1], state[i+2], state[i+3]);
vec4 kv = vec4(s_k[i], s_k[i+1], s_k[i+2], s_k[i+3]);
sv = gv * sv + kv * delta_col;
state[i] = sv.x; state[i+1] = sv.y; state[i+2] = sv.z; state[i+3] = sv.w;
attn_col += dot(sv, vec4(s_q[i], s_q[i+1], s_q[i+2], s_q[i+3]));
}
data_dst[attn_off + col] = attn_col * scale;
}
attn_off += S_V * H;
barrier();
}
[[unroll]] for (uint i = 0; i < S_V; i++) {
data_dst[s_off + state_base + i * S_V + col] = state[i];
}
}
@@ -5,8 +5,9 @@
#include "types.glsl"
layout(constant_id = 0) const uint BLOCK_SIZE = 32;
layout(constant_id = 1) const uint TOKENS_PER_WG = 16;
layout(local_size_x_id = 0, local_size_y = 1, local_size_z = 1) in;
layout(local_size_x_id = 0, local_size_y_id = 1, local_size_z = 1) in;
layout(binding = 0) readonly buffer Src0 { float src0[]; };
layout(binding = 1) readonly buffer Src1 { float src1[]; };
@@ -20,25 +21,30 @@ layout(push_constant) uniform PushConstants {
};
void main() {
const uint global_thread_id = gl_GlobalInvocationID.x;
const uint i2 = gl_WorkGroupID.y;
const uint i1 = gl_GlobalInvocationID.x;
const uint i2 = gl_WorkGroupID.y * TOKENS_PER_WG + gl_LocalInvocationID.y;
const uint i3 = gl_WorkGroupID.z;
if (global_thread_id >= nr || i2 >= n_t || i3 >= n_s) {
if (i1 >= nr || i2 >= n_t || i3 >= n_s) {
return;
}
const uint i1 = global_thread_id;
const uint src0_base = i3 * (nb02 / 4) + i2 + i1 * (nb01 / 4);
const uint src1_base = i1 * (nb11 / 4);
const uint dst_idx = i3 * (dst_nb2 / 4) + i2 * (dst_nb1 / 4) + i1;
float sum = 0.0;
[[unroll]] for (uint i0 = 0; i0 < nc; i0++) {
const uint src0_idx = src0_base + i0;
const uint src1_idx = src1_base + i0;
sum += src0[src0_idx] * src1[src1_idx];
if (nc == 4) {
sum = dot(
vec4(src0[src0_base], src0[src0_base + 1], src0[src0_base + 2], src0[src0_base + 3]),
vec4(src1[src1_base], src1[src1_base + 1], src1[src1_base + 2], src1[src1_base + 3])
);
} else {
[[unroll]] for (uint i0 = 0; i0 < nc; i0++) {
sum += src0[src0_base + i0] * src1[src1_base + i0];
}
}
const uint dst_idx = i3 * (dst_nb2 / 4) + i2 * (dst_nb1 / 4) + i1;
dst[dst_idx] = sum;
}
@@ -987,6 +987,8 @@ void process_shaders() {
string_to_spv("rwkv_wkv7_f32", "wkv7.comp", merge_maps(base_dict, {{"A_TYPE", "float"}}));
string_to_spv("gated_delta_net_f32", "gated_delta_net.comp", merge_maps(base_dict, {{"FLOAT_TYPE", "float"}}));
string_to_spv("opt_step_adamw_f32", "opt_step_adamw.comp", merge_maps(base_dict, {{"A_TYPE", "float"}}));
string_to_spv("opt_step_sgd_f32", "opt_step_sgd.comp", merge_maps(base_dict, {{"A_TYPE", "float"}}));
+1 -1
View File
@@ -5,7 +5,7 @@ import os
import sys
import subprocess
HTTPLIB_VERSION = "refs/tags/v0.37.0"
HTTPLIB_VERSION = "refs/tags/v0.37.1"
vendor = {
"https://github.com/nlohmann/json/releases/latest/download/json.hpp": "vendor/nlohmann/json.hpp",
+28 -1
View File
@@ -7,6 +7,7 @@
#include "llama-memory.h"
#include "llama-mmap.h"
#include "llama-model.h"
#include "llama-ext.h"
#include <cinttypes>
#include <cmath>
@@ -341,6 +342,14 @@ llama_context::llama_context(
if (cparams.pipeline_parallel) {
LLAMA_LOG_INFO("%s: pipeline parallelism enabled\n", __func__);
if (!graph_reuse_disable) {
// TODO: figure out a way to make graph reuse work with pipeline parallelism
// ref: https://github.com/ggml-org/llama.cpp/pull/20463
LLAMA_LOG_WARN("%s: graph reuse is currently not compatible with pipeline parallelism - disabling\n", __func__);
graph_reuse_disable = true;
}
}
sched_reserve();
@@ -503,7 +512,12 @@ void llama_context::sched_reserve() {
if (cparams.fused_gdn_ch) {
// more than one token in the batch per sequence in order to take the chunked path
auto * gf = graph_reserve(16*n_seqs, n_seqs, n_outputs, mctx.get(), true);
// note: n_outputs must match n_tokens for embedding models with mean/rank pooling,
// because build_pooling creates inp_mean with shape [n_tokens, n_seqs] and multiplies
// it with t_embd which is reduced to [n_outputs, ...] via out_ids. if n_outputs != n_tokens,
// the ggml_mul_mat assertion fails. this matches the pp reservation below (line ~553).
const uint32_t n_tokens_ch = 16*n_seqs;
auto * gf = graph_reserve(n_tokens_ch, n_seqs, n_tokens_ch, mctx.get(), true);
if (!gf) {
throw std::runtime_error("failed to reserve graph for fused Gated Delta Net check (chunked)");
}
@@ -3129,6 +3143,19 @@ uint32_t llama_get_sampled_probs_count_ith(llama_context * ctx, int32_t i) {
return static_cast<uint32_t>(ctx->get_sampled_probs_count(i));
}
struct ggml_cgraph * llama_graph_reserve(
struct llama_context * ctx,
uint32_t n_tokens,
uint32_t n_seqs,
uint32_t n_outputs) {
auto * memory = ctx->get_memory();
llama_memory_context_ptr mctx;
if (memory) {
mctx = memory->init_full();
}
return ctx->graph_reserve(n_tokens, n_seqs, n_outputs, mctx.get());
}
// llama adapter API
int32_t llama_set_adapters_lora(
+12
View File
@@ -0,0 +1,12 @@
#pragma once
#include "llama-context.h"
#include "ggml.h"
#include "stdint.h"
// Reserve a new compute graph. It is valid until the next call to llama_graph_reserve.
LLAMA_API struct ggml_cgraph * llama_graph_reserve(
struct llama_context * ctx,
uint32_t n_tokens,
uint32_t n_seqs,
uint32_t n_outputs);
+5 -5
View File
@@ -1160,13 +1160,13 @@ struct llama_grammar * llama_grammar_init_impl(
// if there is a grammar, parse it
// rules will be empty (default) if there are parse errors
if (!parser.parse(grammar_str) || parser.rules.empty()) {
fprintf(stderr, "%s: failed to parse grammar\n", __func__);
LLAMA_LOG_ERROR("failed to parse grammar\n");
return nullptr;
}
// Ensure that there is a "root" node.
if (parser.symbol_ids.find("root") == parser.symbol_ids.end()) {
fprintf(stderr, "%s: grammar does not contain a 'root' symbol\n", __func__);
// Ensure that the grammar contains the start symbol
if (parser.symbol_ids.find(grammar_root) == parser.symbol_ids.end()) {
LLAMA_LOG_ERROR("grammar does not contain a '%s' symbol\n", grammar_root);
return nullptr;
}
@@ -1195,7 +1195,7 @@ struct llama_grammar * llama_grammar_init_impl(
continue;
}
if (llama_grammar_detect_left_recursion(vec_rules, i, &rules_visited, &rules_in_progress, &rules_may_be_empty)) {
LLAMA_LOG_ERROR("unsupported grammar, left recursion detected for nonterminal at index %zu", i);
LLAMA_LOG_ERROR("unsupported grammar, left recursion detected for nonterminal at index %zu\n", i);
return nullptr;
}
}
+3 -1
View File
@@ -260,6 +260,7 @@ endif()
set(LLAMA_TEST_NAME test-mtmd-c-api)
llama_build_and_test(test-mtmd-c-api.c)
target_link_libraries(${LLAMA_TEST_NAME} PRIVATE mtmd)
unset(LLAMA_TEST_NAME)
# GGUF model data fetcher library for tests that need real model metadata
# Only compile when cpp-httplib has SSL support (CPPHTTPLIB_OPENSSL_SUPPORT)
@@ -284,4 +285,5 @@ target_link_libraries(${TEST_TARGET} PRIVATE llama)
llama_build_and_test(test-alloc.cpp)
target_include_directories(test-alloc PRIVATE ${PROJECT_SOURCE_DIR}/ggml/src)
llama_build(export-graph-ops.cpp)
target_include_directories(export-graph-ops PRIVATE ${PROJECT_SOURCE_DIR}/ggml/src)
+169
View File
@@ -0,0 +1,169 @@
#include "arg.h"
#include "common.h"
#include "log.h"
#include "llama.h"
#include "../src/llama-ext.h"
#include "ggml.h"
#include <array>
#include <vector>
#include <set>
#include <fstream>
#include <iostream>
struct input_tensor {
ggml_type type;
std::array<int64_t, 4> ne;
std::array<size_t, 4> nb;
input_tensor(ggml_type type, int64_t * ne, size_t * nb): type(type) {
memcpy(this->ne.data(), ne, 4 * sizeof(int64_t));
memcpy(this->nb.data(), nb, 4 * sizeof(size_t));
}
bool operator<(const input_tensor &b) const {
return std::tie(type, ne, nb) <
std::tie(b.type, b.ne, b.nb);
}
void serialize(std::ostream& out) const {
out << type << ' ';
for (size_t i = 0; i < 4; i++) {
out << ne[i] << ' ';
}
for (size_t i = 0; i < 4; i++) {
out << nb[i] << ' ';
}
}
};
struct test_object {
ggml_op op;
ggml_type type;
std::array<int64_t, 4> ne;
std::vector<int32_t> op_params;
std::vector<input_tensor> sources;
std::string name;
void serialize(std::ostream& out) const {
out << op << ' ' << type << ' ';
for (size_t i = 0; i < 4; i++) {
out << ne[i] << ' ';
}
out << op_params.size() << ' ';
for (size_t i = 0; i < op_params.size(); i++) {
out << op_params[i] << ' ';
}
out << sources.size() << ' ';
for (size_t s = 0; s < sources.size(); s++) {
sources[s].serialize(out);
}
if (!name.empty()) {
out << name;
} else {
out << '-';
}
out << '\n';
}
bool operator<(const test_object &b) const {
return std::tie(op, type, ne, op_params, sources) <
std::tie(b.op, b.type, b.ne, b.op_params, b.sources);
}
};
static void extract_graph_ops(ggml_cgraph * cgraph, const char * label, std::set<test_object> & tests) {
int n_nodes = ggml_graph_n_nodes(cgraph);
int n_skipped = 0;
int n_before = (int) tests.size();
for (int i = 0; i < n_nodes; i++) {
ggml_tensor * node = ggml_graph_node(cgraph, i);
if (node->op == GGML_OP_NONE || node->op == GGML_OP_VIEW || node->op == GGML_OP_RESHAPE || node->op == GGML_OP_PERMUTE || node->op == GGML_OP_TRANSPOSE) {
n_skipped++;
continue;
}
test_object test;
test.op = node->op;
test.type = node->type;
memcpy(&test.ne, node->ne, 4 * sizeof(int64_t));
test.op_params.resize(GGML_MAX_OP_PARAMS / sizeof(int32_t));
memcpy(test.op_params.data(), node->op_params, GGML_MAX_OP_PARAMS);
for (size_t s = 0; s < GGML_MAX_SRC; s++) {
if (node->src[s] == nullptr) {
break;
}
test.sources.emplace_back(node->src[s]->type, node->src[s]->ne, node->src[s]->nb);
}
test.name = node->name;
tests.insert(test);
}
int n_new = (int) tests.size() - n_before;
LOG_INF("%s: %d unique ops, %d total nodes, %d skipped (view ops)\n",
label, n_new, n_nodes, n_skipped);
}
int main(int argc, char ** argv) {
common_params params;
params.out_file = "tests.txt";
if (!common_params_parse(argc, argv, params, LLAMA_EXAMPLE_EXPORT_GRAPH_OPS)) {
return 1;
}
common_init();
// Load CPU-only
ggml_backend_dev_t cpu_device = ggml_backend_dev_by_type(GGML_BACKEND_DEVICE_TYPE_CPU);
params.devices = { cpu_device, nullptr };
params.fit_params = false;
params.n_gpu_layers = 0;
params.warmup = false;
auto init_result = common_init_from_params(params);
llama_context * ctx = init_result->context();
const uint32_t n_seqs = llama_n_seq_max(ctx);
const uint32_t n_tokens = std::min(llama_n_ctx(ctx), llama_n_ubatch(ctx));
std::set<test_object> tests;
auto * gf_pp = llama_graph_reserve(ctx, n_tokens, n_seqs, n_tokens);
if (!gf_pp) {
throw std::runtime_error("failed to reserve prompt processing graph");
}
extract_graph_ops(gf_pp, "pp", tests);
auto * gf_tg = llama_graph_reserve(ctx, n_seqs, n_seqs, n_seqs);
if (!gf_tg) {
throw std::runtime_error("failed to reserve token generation graph");
}
extract_graph_ops(gf_tg, "tg", tests);
LOG_INF("%d unique ops total\n", (int) tests.size());
std::ofstream f(params.out_file);
if (!f.is_open()) {
throw std::runtime_error("Unable to open output file");
}
for (const auto& test : tests) {
test.serialize(f);
}
return 0;
}
+345 -12
View File
@@ -31,10 +31,12 @@
#include <cstring>
#include <ctime>
#include <future>
#include <fstream>
#include <memory>
#include <random>
#include <regex>
#include <set>
#include <sstream>
#include <string>
#include <string_view>
#include <thread>
@@ -6648,6 +6650,236 @@ struct test_diag : public test_case {
}
};
// Deserializable generic test case
struct input_tensor {
ggml_type type;
std::array<int64_t, 4> ne;
std::array<size_t, 4> nb; // strides (0 = use default contiguous strides)
};
static bool is_non_contiguous(const input_tensor & src) {
if (src.nb[0] == 0) {
return false;
}
const size_t default_nb0 = ggml_type_size(src.type);
const size_t default_nb1 = default_nb0 * (src.ne[0] / ggml_blck_size(src.type));
const size_t default_nb2 = default_nb1 * src.ne[1];
const size_t default_nb3 = default_nb2 * src.ne[2];
return src.nb[0] != default_nb0 ||
src.nb[1] != default_nb1 ||
src.nb[2] != default_nb2 ||
src.nb[3] != default_nb3;
}
static std::string var_to_str(const std::vector<input_tensor>& sources) {
std::ostringstream oss;
bool first = true;
for (const auto& src : sources) {
if (!first) oss << ",";
oss << ggml_type_name(src.type) << "[" << src.ne[0] << "," << src.ne[1] << "," << src.ne[2] << "," << src.ne[3] << "]";
if (is_non_contiguous(src)) {
oss << "nb[" << src.nb[0] << "," << src.nb[1] << "," << src.nb[2] << "," << src.nb[3] << "]";
}
first = false;
}
return oss.str();
}
static std::string var_to_str(const std::array<int32_t, GGML_MAX_OP_PARAMS / sizeof(int32_t)>& params) {
std::ostringstream oss;
oss << "[";
bool first = true;
for (size_t i = 0; i < params.size(); ++i) {
if (params[i] != 0) {
if (!first) oss << ",";
oss << i << ":" << params[i];
first = false;
}
}
oss << "]";
return oss.str();
}
struct test_generic_op : public test_case {
const ggml_op op;
const ggml_type type;
const std::array<int64_t, 4> ne;
const std::array<int32_t, GGML_MAX_OP_PARAMS / sizeof(int32_t)> op_params;
const std::vector<input_tensor> sources;
const std::string name;
std::string vars() override {
if (name.empty()) {
return VARS_TO_STR4(type, ne, op_params, sources);
}
return VARS_TO_STR5(name, type, ne, op_params, sources);
}
test_generic_op(ggml_op op, ggml_type type, std::array<int64_t, 4> ne,
std::array<int32_t, GGML_MAX_OP_PARAMS / sizeof(int32_t)> op_params,
std::vector<input_tensor> sources, std::string name = "")
: op(op), type(type), ne(ne), op_params(op_params), sources(sources), name(std::move(name)) {}
ggml_tensor * build_graph(ggml_context * ctx) override {
const size_t source_count = std::min(sources.size(), (size_t)GGML_MAX_SRC);
std::array<ggml_tensor *, GGML_MAX_SRC> source_tensors;
for (size_t i = 0; i < source_count; ++i) {
const input_tensor& src = sources[i];
if (is_non_contiguous(src)) {
size_t total_size;
const size_t blck_size = ggml_blck_size(src.type);
if (blck_size == 1) {
total_size = ggml_type_size(src.type);
for (int d = 0; d < 4; d++) {
total_size += (src.ne[d] - 1) * src.nb[d];
}
} else {
total_size = src.ne[0] * src.nb[0] / blck_size;
for (int d = 1; d < 4; d++) {
total_size += (src.ne[d] - 1) * src.nb[d];
}
}
// Convert bytes to elements, padded to block size for quantized types
const size_t type_size = ggml_type_size(src.type);
size_t backing_elements = (total_size * blck_size + type_size - 1) / type_size;
backing_elements = ((backing_elements + blck_size - 1) / blck_size) * blck_size;
ggml_tensor * backing = ggml_new_tensor_1d(ctx, src.type, backing_elements);
source_tensors[i] = ggml_view_4d(ctx, backing,
src.ne[0], src.ne[1], src.ne[2], src.ne[3],
src.nb[1], src.nb[2], src.nb[3], 0);
// nb[0] does not get set by view_4d, so set it manually
source_tensors[i]->nb[0] = src.nb[0];
} else {
source_tensors[i] = ggml_new_tensor_4d(ctx, src.type, src.ne[0], src.ne[1], src.ne[2], src.ne[3]);
}
}
// Ops with an inplace flag create a view of src[0] as their output.
bool inplace = false;
if (op == GGML_OP_SET || op == GGML_OP_ACC) {
inplace = op_params[4] != 0;
} else if (op == GGML_OP_ADD_REL_POS) {
inplace = op_params[0] != 0;
}
ggml_tensor * out;
if (inplace && source_count > 0) {
out = ggml_view_tensor(ctx, source_tensors[0]);
} else {
out = ggml_new_tensor_4d(ctx, type, ne[0], ne[1], ne[2], ne[3]);
}
out->op = op;
for (size_t i = 0; i < source_count; ++i) {
out->src[i] = source_tensors[i];
}
memcpy(out->op_params, op_params.data(), GGML_MAX_OP_PARAMS);
ggml_set_name(out, "out");
return out;
}
double max_nmse_err() override {
switch (op) {
case GGML_OP_MUL_MAT:
case GGML_OP_MUL_MAT_ID:
case GGML_OP_OUT_PROD:
case GGML_OP_CONV_TRANSPOSE_2D:
case GGML_OP_IM2COL:
case GGML_OP_CONV_2D:
case GGML_OP_CONV_3D:
case GGML_OP_SET_ROWS:
case GGML_OP_CPY:
return 5e-4;
case GGML_OP_SOFT_MAX:
return 1e-6;
case GGML_OP_RWKV_WKV7:
return 5e-3;
case GGML_OP_FLASH_ATTN_EXT:
{
// Scale error with kv length to account for accumulating floating point error
const int64_t kv = sources[1].ne[1];
return 5e-4 * std::max(1.0, kv / 20000.0);
}
default:
return 1e-7;
}
}
void initialize_tensors(ggml_context * ctx) override {
ggml_tensor * out = ggml_get_tensor(ctx, "out");
std::random_device rd;
std::default_random_engine rng(rd());
for (size_t i = 0; i < sources.size() && i < GGML_MAX_SRC; i++) {
ggml_tensor * t = out->src[i];
if (!t) {
break;
}
// FLASH_ATTN_EXT: src[3] is the KQ mask
if (op == GGML_OP_FLASH_ATTN_EXT && i == 3) {
init_tensor_kq_mask(t);
continue;
}
if (t->type == GGML_TYPE_I32 || t->type == GGML_TYPE_I64) {
if (op == GGML_OP_GET_ROWS || op == GGML_OP_GET_ROWS_BACK) {
const int64_t num_rows = sources[0].ne[1];
const int64_t nels = ggml_nelements(t);
std::vector<int32_t> data(nels);
std::uniform_int_distribution<int32_t> dist(0, num_rows - 1);
for (int64_t i = 0; i < nels; i++) {
data[i] = dist(rng);
}
ggml_backend_tensor_set(t, data.data(), 0, nels * sizeof(int32_t));
} else if (op == GGML_OP_SET_ROWS) {
init_set_rows_row_ids(t, ne[1]);
} else if (op == GGML_OP_ROPE) {
const int mode = op_params[2];
const int64_t nels = (mode & GGML_ROPE_TYPE_MROPE) ? ne[2] * 4 : ne[2];
std::vector<int32_t> data(nels);
std::uniform_int_distribution<int32_t> dist(0, ne[2] - 1);
for (int64_t i = 0; i < nels; i++) {
data[i] = dist(rng);
}
ggml_backend_tensor_set(t, data.data(), 0, nels * sizeof(int32_t));
} else if (op == GGML_OP_MUL_MAT_ID || op == GGML_OP_ADD_ID) {
const int64_t n_expert = (op == GGML_OP_MUL_MAT_ID) ? sources[0].ne[2] : sources[1].ne[1];
for (int64_t r = 0; r < ggml_nrows(t); r++) {
std::vector<int32_t> data(t->ne[0]);
for (int32_t i = 0; i < t->ne[0]; i++) {
data[i] = i % n_expert;
}
std::shuffle(data.begin(), data.end(), rng);
ggml_backend_tensor_set(t, data.data(), r * t->nb[1], t->ne[0] * sizeof(int32_t));
}
} else if (op == GGML_OP_SSM_SCAN) {
for (int64_t r = 0; r < ggml_nrows(t); r++) {
std::vector<int32_t> data(t->ne[0]);
for (int32_t i = 0; i < t->ne[0]; i++) {
data[i] = i;
}
std::shuffle(data.begin(), data.end(), rng);
ggml_backend_tensor_set(t, data.data(), r * t->nb[1], t->ne[0] * sizeof(int32_t));
}
} else {
init_tensor_uniform(t);
}
} else {
init_tensor_uniform(t);
}
}
}
};
enum llm_norm_type {
LLM_NORM,
@@ -8731,11 +8963,92 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_perf() {
test_cases.emplace_back(new test_acc(GGML_TYPE_F32, {256, 17, 2, 3}, {128, 16, 2, 3}, 2));
test_cases.emplace_back(new test_acc(GGML_TYPE_F32, {256, 17, 2, 3}, {64, 16, 2, 3}, 3));
// GATED_DELTA_NET: realistic model configurations
// TG: n_seq_tokens=1 (autoregressive)
test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 32, 128, 1, 1)); // Qwen3.5-like: 32 heads, d=128
test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 16, 64, 1, 1)); // smaller model
test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 32, 128, 1, 1, 1, false, true)); // KDA
// PP: n_seq_tokens=64,256 (prompt processing)
test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 32, 128, 64, 1)); // PP-64
test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 32, 128, 256, 1)); // PP-256
test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 32, 128, 512, 1)); // PP-512
test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 32, 128, 1024, 1)); // PP-1024
// Small model configs (fewer heads = less GPU occupancy for autoregressive)
test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 128, 64, 1)); // 4h PP-64
test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 128, 256, 1)); // 4h PP-256
test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 128, 512, 1)); // 4h PP-512
test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 128, 1024, 1)); // 4h PP-1024
test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 32, 128, 64, 1, 1, false, true)); // KDA PP-64
return test_cases;
}
static std::vector<std::unique_ptr<test_case>> make_test_cases_from_file(const char * path) {
std::ifstream f(path);
if (!f.is_open()) {
throw std::runtime_error("Unable to read test file");
}
std::vector<std::unique_ptr<test_case>> test_cases;
std::string line;
while (std::getline(f, line)) {
std::istringstream iss(line);
ggml_op op;
ggml_type type;
std::array<int64_t, 4> ne;
std::array<int32_t, GGML_MAX_OP_PARAMS / sizeof(int32_t)> op_params = {};
std::string name;
uint64_t tmp;
iss >> tmp;
op = (ggml_op)tmp;
iss >> tmp;
type = (ggml_type)tmp;
for (size_t i = 0; i < 4; i++) {
iss >> ne[i];
}
iss >> tmp;
for (size_t i = 0; i < tmp && i < op_params.size(); i++) {
iss >> op_params[i];
}
iss >> tmp;
size_t num_src = std::min((uint64_t)GGML_MAX_SRC, tmp);
std::vector<input_tensor> sources(num_src);
for (size_t i = 0; i < num_src; i++) {
input_tensor& src = sources[i];
iss >> tmp;
src.type = (ggml_type)tmp;
for (size_t i = 0; i < 4; i++) {
iss >> src.ne[i];
}
for (size_t i = 0; i < 4; i++) {
iss >> src.nb[i];
}
}
iss >> name;
if (name.length() == 1 && name[0] == '-') {
name = "";
}
test_cases.emplace_back(new test_generic_op(op, type, ne, op_params, sources, std::move(name)));
}
return test_cases;
}
static bool test_backend(ggml_backend_t backend, test_mode mode, const char * op_names_filter, const char * params_filter,
printer * output_printer) {
printer * output_printer, const char * test_file_path) {
auto filter_test_cases = [](std::vector<std::unique_ptr<test_case>> & test_cases, const char * params_filter) {
if (params_filter == nullptr) {
return;
@@ -8753,9 +9066,26 @@ static bool test_backend(ggml_backend_t backend, test_mode mode, const char * op
}
};
std::vector<std::unique_ptr<test_case>> test_cases;
if (test_file_path == nullptr) {
switch (mode) {
case MODE_TEST:
case MODE_GRAD:
case MODE_SUPPORT:
test_cases = make_test_cases_eval();
break;
case MODE_PERF:
test_cases = make_test_cases_perf();
break;
}
} else {
test_cases = make_test_cases_from_file(test_file_path);
}
filter_test_cases(test_cases, params_filter);
if (mode == MODE_TEST) {
auto test_cases = make_test_cases_eval();
filter_test_cases(test_cases, params_filter);
ggml_backend_t backend_cpu = ggml_backend_init_by_type(GGML_BACKEND_DEVICE_TYPE_CPU, NULL);
if (backend_cpu == NULL) {
test_operation_info info("", "", "CPU");
@@ -8795,8 +9125,6 @@ static bool test_backend(ggml_backend_t backend, test_mode mode, const char * op
}
if (mode == MODE_GRAD) {
auto test_cases = make_test_cases_eval();
filter_test_cases(test_cases, params_filter);
size_t n_ok = 0;
for (auto & test : test_cases) {
if (test->eval_grad(backend, op_names_filter, output_printer)) {
@@ -8809,8 +9137,6 @@ static bool test_backend(ggml_backend_t backend, test_mode mode, const char * op
}
if (mode == MODE_PERF) {
auto test_cases = make_test_cases_perf();
filter_test_cases(test_cases, params_filter);
for (auto & test : test_cases) {
test->eval_perf(backend, op_names_filter, output_printer);
}
@@ -8818,9 +9144,6 @@ static bool test_backend(ggml_backend_t backend, test_mode mode, const char * op
}
if (mode == MODE_SUPPORT) {
auto test_cases = make_test_cases_eval();
filter_test_cases(test_cases, params_filter);
// Filter out fusion cases
test_cases.erase(
std::remove_if(test_cases.begin(), test_cases.end(), [](const std::unique_ptr<test_case> & tc) {
@@ -8939,7 +9262,8 @@ static void show_test_coverage() {
}
static void usage(char ** argv) {
printf("Usage: %s [mode] [-o <op,..>] [-b <backend>] [-p <params regex>] [--output <console|sql|csv>] [--list-ops] [--show-coverage]\n", argv[0]);
printf("Usage: %s [mode] [-o <op,..>] [-b <backend>] [-p <params regex>] [--output <console|sql|csv>] [--list-ops]", argv[0]);
printf(" [--show-coverage] [--test-file <path>]\n");
printf(" valid modes:\n");
printf(" - test (default, compare with CPU backend for correctness)\n");
printf(" - grad (compare gradients from backpropagation with method of finite differences)\n");
@@ -8950,6 +9274,7 @@ static void usage(char ** argv) {
printf(" --output specifies output format (default: console, options: console, sql, csv)\n");
printf(" --list-ops lists all available GGML operations\n");
printf(" --show-coverage shows test coverage\n");
printf(" --test-file reads test operators from a test file generated by llama-export-graph-ops\n");
}
int main(int argc, char ** argv) {
@@ -8958,6 +9283,7 @@ int main(int argc, char ** argv) {
const char * op_names_filter = nullptr;
const char * backend_filter = nullptr;
const char * params_filter = nullptr;
const char * test_file_path = nullptr;
for (int i = 1; i < argc; i++) {
if (strcmp(argv[i], "test") == 0) {
@@ -9005,6 +9331,13 @@ int main(int argc, char ** argv) {
} else if (strcmp(argv[i], "--show-coverage") == 0) {
show_test_coverage();
return 0;
} else if (strcmp(argv[i], "--test-file") == 0) {
if (i + 1 < argc) {
test_file_path = argv[++i];
} else {
usage(argv);
return 1;
}
} else {
usage(argv);
return 1;
@@ -9057,7 +9390,7 @@ int main(int argc, char ** argv) {
false, "", ggml_backend_dev_description(dev),
total / 1024 / 1024, free / 1024 / 1024, true));
bool ok = test_backend(backend, mode, op_names_filter, params_filter, output_printer.get());
bool ok = test_backend(backend, mode, op_names_filter, params_filter, output_printer.get(), test_file_path);
if (ok) {
n_ok++;
+37 -1
View File
@@ -15,8 +15,12 @@
using json = nlohmann::ordered_json;
static llama_grammar * build_grammar_with_root(const std::string & grammar_str, const char * grammar_root) {
return llama_grammar_init_impl(nullptr, grammar_str.c_str(), grammar_root, false, nullptr, 0, nullptr, 0);
}
static llama_grammar * build_grammar(const std::string & grammar_str) {
return llama_grammar_init_impl(nullptr, grammar_str.c_str(), "root", false, nullptr, 0, nullptr, 0);
return build_grammar_with_root(grammar_str, "root");
}
static bool test_build_grammar_fails(const std::string & grammar_str) {
@@ -860,6 +864,36 @@ static void test_failure_left_recursion() {
fprintf(stderr, " ✅︎ Passed\n");
}
static void test_failure_missing_root_symbol() {
fprintf(stderr, "⚫ Testing missing root symbol:\n");
const std::string grammar_str = R"""(
root ::= "foobar"
)""";
llama_grammar * failure_result = build_grammar_with_root(grammar_str, "nonexistent");
assert(failure_result == nullptr);
fprintf(stderr, " ✅︎ Passed\n");
}
static void test_custom_root_symbol_check() {
fprintf(stderr, "⚫ Testing custom root symbol check:\n");
const std::string custom_root_grammar_str = R"""(
foobar ::= "foobar"
)""";
llama_grammar * failure_result = build_grammar_with_root(custom_root_grammar_str, "root");
assert(failure_result == nullptr);
llama_grammar * success_result = build_grammar_with_root(custom_root_grammar_str, "foobar");
assert(success_result != nullptr);
llama_grammar_free_impl(success_result);
fprintf(stderr, " ✅︎ Passed\n");
}
static void test_json_schema() {
// Note that this is similar to the regular grammar tests,
// but we convert each json schema to a grammar before parsing.
@@ -1433,6 +1467,8 @@ int main() {
test_failure_missing_root();
test_failure_missing_reference();
test_failure_left_recursion();
test_failure_missing_root_symbol();
test_custom_root_symbol_check();
test_json_schema();
fprintf(stdout, "All tests passed.\n");
return 0;
+3 -3
View File
@@ -470,12 +470,12 @@ static bool decode_audio_from_buf(const unsigned char * buf_in, size_t len, int
mtmd_bitmap * mtmd_helper_bitmap_init_from_buf(mtmd_context * ctx, const unsigned char * buf, size_t len) {
if (audio_helpers::is_audio_file((const char *)buf, len)) {
std::vector<float> pcmf32;
int bitrate = mtmd_get_audio_bitrate(ctx);
if (bitrate < 0) {
const int sample_rate = mtmd_get_audio_sample_rate(ctx);
if (sample_rate < 0) {
LOG_ERR("This model does not support audio input\n");
return nullptr;
}
if (!audio_helpers::decode_audio_from_buf(buf, len, bitrate, pcmf32)) {
if (!audio_helpers::decode_audio_from_buf(buf, len, sample_rate, pcmf32)) {
LOG_ERR("Unable to read WAV audio file from buffer\n");
return nullptr;
}
+1 -1
View File
@@ -912,7 +912,7 @@ bool mtmd_support_audio(mtmd_context * ctx) {
return ctx->ctx_a != nullptr;
}
int mtmd_get_audio_bitrate(mtmd_context * ctx) {
int mtmd_get_audio_sample_rate(mtmd_context * ctx) {
if (!ctx->ctx_a) {
return -1;
}
+2 -2
View File
@@ -125,9 +125,9 @@ MTMD_API bool mtmd_support_vision(mtmd_context * ctx);
// whether the current model supports audio input
MTMD_API bool mtmd_support_audio(mtmd_context * ctx);
// get audio bitrate in Hz, for example 16000 for Whisper
// get audio sample rate in Hz, for example 16000 for Whisper
// return -1 if audio is not supported
MTMD_API int mtmd_get_audio_bitrate(mtmd_context * ctx);
MTMD_API int mtmd_get_audio_sample_rate(mtmd_context * ctx);
// mtmd_bitmap
//
Binary file not shown.
+3
View File
@@ -1189,6 +1189,9 @@ private:
? SLOT_STATE_WAIT_OTHER // wait for the parent to process prompt
: SLOT_STATE_STARTED;
// reset server kill-switch counter
n_empty_consecutive = 0;
SLT_INF(slot, "processing task, is_child = %d\n", slot.task->is_child());
return true;
}
+34
View File
@@ -101,6 +101,40 @@ def test_embedding_mixed_input(input, is_multi_prompt: bool):
assert len(data[0]['embedding']) > 1
def test_embedding_pooling_mean():
global server
server.pooling = 'mean'
server.start()
res = server.make_request("POST", "/v1/embeddings", data={
"input": "I believe the meaning of life is",
})
assert res.status_code == 200
assert len(res.body['data']) == 1
assert 'embedding' in res.body['data'][0]
assert len(res.body['data'][0]['embedding']) > 1
# make sure embedding vector is normalized
assert abs(sum([x ** 2 for x in res.body['data'][0]['embedding']]) - 1) < EPSILON
def test_embedding_pooling_mean_multiple():
global server
server.pooling = 'mean'
server.start()
res = server.make_request("POST", "/v1/embeddings", data={
"input": [
"I believe the meaning of life is",
"Write a joke about AI",
"This is a test",
],
})
assert res.status_code == 200
assert len(res.body['data']) == 3
for d in res.body['data']:
assert 'embedding' in d
assert len(d['embedding']) > 1
def test_embedding_pooling_none():
global server
server.pooling = 'none'
+13 -12
View File
@@ -11,6 +11,7 @@ sys.path.insert(0, str(path))
import datetime
from utils import *
from typing import Literal
server: ServerProcess
@@ -23,24 +24,24 @@ def create_server():
@pytest.mark.parametrize("tools", [None, [], [TEST_TOOL]])
@pytest.mark.parametrize("template_name,reasoning_budget,expected_end", [
("deepseek-ai-DeepSeek-R1-Distill-Qwen-32B", None, "<think>\n"),
("deepseek-ai-DeepSeek-R1-Distill-Qwen-32B", -1, "<think>\n"),
("deepseek-ai-DeepSeek-R1-Distill-Qwen-32B", 0, "<think>\n</think>"),
@pytest.mark.parametrize("template_name,reasoning,expected_end", [
("deepseek-ai-DeepSeek-R1-Distill-Qwen-32B", "on", "<think>\n"),
("deepseek-ai-DeepSeek-R1-Distill-Qwen-32B","auto", "<think>\n"),
("deepseek-ai-DeepSeek-R1-Distill-Qwen-32B", "off", "<think>\n</think>"),
("Qwen-Qwen3-0.6B", -1, "<|im_start|>assistant\n"),
("Qwen-Qwen3-0.6B", 0, "<|im_start|>assistant\n<think>\n\n</think>\n\n"),
("Qwen-Qwen3-0.6B","auto", "<|im_start|>assistant\n"),
("Qwen-Qwen3-0.6B", "off", "<|im_start|>assistant\n<think>\n\n</think>\n\n"),
("Qwen-QwQ-32B", -1, "<|im_start|>assistant\n<think>\n"),
("Qwen-QwQ-32B", 0, "<|im_start|>assistant\n<think>\n</think>"),
("Qwen-QwQ-32B","auto", "<|im_start|>assistant\n<think>\n"),
("Qwen-QwQ-32B", "off", "<|im_start|>assistant\n<think>\n</think>"),
("CohereForAI-c4ai-command-r7b-12-2024-tool_use", -1, "<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>"),
("CohereForAI-c4ai-command-r7b-12-2024-tool_use", 0, "<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|><|START_THINKING|><|END_THINKING|>"),
("CohereForAI-c4ai-command-r7b-12-2024-tool_use","auto", "<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>"),
("CohereForAI-c4ai-command-r7b-12-2024-tool_use", "off", "<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|><|START_THINKING|><|END_THINKING|>"),
])
def test_reasoning_budget(template_name: str, reasoning_budget: int | None, expected_end: str, tools: list[dict]):
def test_reasoning(template_name: str, reasoning: Literal['on', 'off', 'auto'] | None, expected_end: str, tools: list[dict]):
global server
server.jinja = True
server.reasoning_budget = reasoning_budget
server.reasoning = reasoning
server.chat_template_file = f'../../../models/templates/{template_name}.jinja'
server.start()
+3 -3
View File
@@ -95,7 +95,7 @@ class ServerProcess:
no_webui: bool | None = None
jinja: bool | None = None
reasoning_format: Literal['deepseek', 'none', 'nothink'] | None = None
reasoning_budget: int | None = None
reasoning: Literal['on', 'off', 'auto'] | None = None
chat_template: str | None = None
chat_template_file: str | None = None
server_path: str | None = None
@@ -225,8 +225,8 @@ class ServerProcess:
server_args.append("--no-jinja")
if self.reasoning_format is not None:
server_args.extend(("--reasoning-format", self.reasoning_format))
if self.reasoning_budget is not None:
server_args.extend(("--reasoning-budget", self.reasoning_budget))
if self.reasoning is not None:
server_args.extend(("--reasoning", self.reasoning))
if self.chat_template:
server_args.extend(["--chat-template", self.chat_template])
if self.chat_template_file:
@@ -62,15 +62,12 @@
chatStore.getConversationModel(activeMessages() as DatabaseMessage[])
);
let previousConversationModel: string | null = null;
$effect(() => {
if (conversationModel && conversationModel !== previousConversationModel) {
previousConversationModel = conversationModel;
if (!isRouter || modelsStore.isModelLoaded(conversationModel)) {
modelsStore.selectModelByName(conversationModel);
}
if (conversationModel) {
modelsStore.selectModelByName(conversationModel);
} else if (isRouter && modelsStore.loadedModelIds.length > 0) {
const first = modelOptions().find((m) => modelsStore.loadedModelIds.includes(m.model));
if (first) modelsStore.selectModelById(first.id);
}
});
+12 -5
View File
@@ -4424,7 +4424,8 @@ get_range_offset_and_length(Range r, size_t content_length) {
assert(r.first <= r.second &&
r.second < static_cast<ssize_t>(content_length));
(void)(content_length);
return std::make_pair(r.first, static_cast<size_t>(r.second - r.first) + 1);
return std::make_pair(static_cast<size_t>(r.first),
static_cast<size_t>(r.second - r.first) + 1);
}
std::string make_content_range_header_field(
@@ -8616,11 +8617,17 @@ ClientImpl::open_stream(const std::string &method, const std::string &path,
handle.body_reader_.stream = handle.stream_;
handle.body_reader_.payload_max_length = payload_max_length_;
auto content_length_str = handle.response->get_header_value("Content-Length");
if (!content_length_str.empty()) {
if (handle.response->has_header("Content-Length")) {
bool is_invalid = false;
auto content_length = detail::get_header_value_u64(
handle.response->headers, "Content-Length", 0, 0, is_invalid);
if (is_invalid) {
handle.error = Error::Read;
handle.response.reset();
return handle;
}
handle.body_reader_.has_content_length = true;
handle.body_reader_.content_length =
static_cast<size_t>(std::stoull(content_length_str));
handle.body_reader_.content_length = content_length;
}
auto transfer_encoding =
+3 -23
View File
@@ -8,28 +8,8 @@
#ifndef CPPHTTPLIB_HTTPLIB_H
#define CPPHTTPLIB_HTTPLIB_H
#define CPPHTTPLIB_VERSION "0.37.0"
#define CPPHTTPLIB_VERSION_NUM "0x002500"
/*
* Platform compatibility check
*/
#if defined(_WIN32) && !defined(_WIN64)
#if defined(_MSC_VER)
#pragma message( \
"cpp-httplib doesn't support 32-bit Windows. Please use a 64-bit compiler.")
#else
#warning \
"cpp-httplib doesn't support 32-bit Windows. Please use a 64-bit compiler."
#endif
#elif defined(__SIZEOF_POINTER__) && __SIZEOF_POINTER__ < 8
#warning \
"cpp-httplib doesn't support 32-bit platforms. Please use a 64-bit compiler."
#elif defined(__SIZEOF_SIZE_T__) && __SIZEOF_SIZE_T__ < 8
#warning \
"cpp-httplib doesn't support platforms where size_t is less than 64 bits."
#endif
#define CPPHTTPLIB_VERSION "0.37.1"
#define CPPHTTPLIB_VERSION_NUM "0x002501"
#ifdef _WIN32
#if defined(_WIN32_WINNT) && _WIN32_WINNT < 0x0A00
@@ -2797,7 +2777,7 @@ inline size_t get_header_value_u64(const Headers &headers,
std::advance(it, static_cast<ssize_t>(id));
if (it != rng.second) {
if (is_numeric(it->second)) {
return std::strtoull(it->second.data(), nullptr, 10);
return static_cast<size_t>(std::strtoull(it->second.data(), nullptr, 10));
} else {
is_invalid_value = true;
}