Compare commits

..

35 Commits

Author SHA1 Message Date
o7si 32eddaf2ea cmake : fix ui build with read-only source (#24752) 2026-06-18 18:59:18 +02:00
Xuan-Son Nguyen 060ce1bf72 mtmd: refactor llava-uhd overview image handling (always use ov_img_first) (#24769)
* add dedicated "overview" for mtmd_image_preproc_out

* corrections

* correct (again)

* nits

* nits (2)
2026-06-18 18:53:49 +02:00
Max Krasnyansky d2c67959b3 hexagon: support for op-trace (fine-grain tracing of HVX/HMX/DMA events) (#24592)
* hex-optrace: add support for optrace and instrument matmul and flash-atten code

* hex-trace: improve trace event and prefetto generator

* hex-trace: add new script dedicated to handling traces, specifically perfetto traces

* hex-trace: add --head/--tail options to profile and trace tools

* hex-trace: fix whitespaces

* hex-trace: fix flake8 warnings

* hex-trace: fix flake8 warnings

* hmx-fa: restore q_tiles clearing

* hex-profile: remove circular dep in includes

* hex-trace: simplify trace sizing check

* hex-profile: sort events in the summary by name
2026-06-18 08:35:02 -07:00
Kangjia Gao 7b6c5a2aed docs: fix export-lora --lora-scaled syntax [no release] (#24703)
Assisted-by: Codex
2026-06-18 16:46:17 +02:00
Xuan-Son Nguyen fe7c8b2414 server: (router) fix stopping_thread potentially hang (#24728)
* server: (router) fix stopping_thread potentially hang

* fix windows build
2026-06-18 15:41:09 +02:00
Xuan-Son Nguyen e1efd0991d server: add "schema" and validation (#24150)
* wip

* working

* correct some limits

* add field name to error message
2026-06-18 15:40:58 +02:00
Aarni Koskela 08023072ef server : add last-5-seconds generation speed display (#24291)
* server : add last-5-seconds generation speed display

* cont : clean-up

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-06-18 14:02:20 +02:00
Amos Wong 20832179e2 ui: provide touch accessible model selection UI (#24604)
* ui : add model selector storybook stories

Covers list, favorites, single-model, all status states
(loading/loaded/sleeping/failed/idle), and selection states.

* ui : improve model selector mobile UX with hover media queries

Use @media (hover:none) to show action buttons directly on touch
devices and color-code them by model status (amber=sleeping,
green=loaded, muted=idle). Status dots hidden on touch. Desktop
hover behavior unchanged.
2026-06-18 13:14:20 +02:00
Anuj Attri 10786217e9 server : return HTTP 400 on invalid grammar (#24144) (#24154)
Throw on grammar parse failure so the server returns HTTP 400
instead of silently dropping the constraint.
Add a regression test for the invalid-grammar response.

Fixes #24144
2026-06-18 12:49:14 +02:00
Xuan-Son Nguyen 552258c535 server: (router) rework -hf preset repo (#24739)
* server: temporary remove HF remote preset

* rework remove preset.ini support

* rm unused get_remote_preset_whitelist()

* print warning

* add docs

* rm stray file
2026-06-18 12:45:23 +02:00
Xuan-Son Nguyen 968c43891a server: fix router args not being forwarded to child instances (#24760) 2026-06-18 12:15:46 +02:00
Xuan-Son Nguyen 24bba7b98e mtmd: refactor preprocessor, add mtmd_image_preproc_out (#24736)
* add mtmd_image_preproc_out

* add dev docs

* remove unused clip API

* rm unused clip_image_f32_batch::grid

* change preprocess() call signature
2026-06-18 12:04:39 +02:00
Neo Zhang 9724f664e8 [SYCL] rename GGML_SYCL_SUPPORT_LEVEL_ZERO (#24719)
* rename GGML_SYCL_SUPPORT_LEVEL_ZERO to GGML_SYCL_SUPPORT_LEVEL_ZERO_API, and GGML_SYCL_ENABLE_LEVEL_ZERO to  GGML_SYCL_USE_LEVEL_ZERO_API

* fix code format

* fix error when rebase
2026-06-18 11:18:26 +03:00
Neo Zhang dd69db2924 sycl : support MUL_MAT and OUT_PROD with Q1_0 (#24721) 2026-06-18 11:17:37 +03:00
Adrien Gallouët 6ec59ddaea app : enable self-update only when built with llama-install.sh (#24754)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-06-18 09:57:59 +02:00
Sigbjørn Skjæret 32e806b9c1 ci : fix check-release message parsing (#24751) 2026-06-18 09:32:56 +02:00
Neo Zhang 6f1034b32a [SYCL] support OPs: conv_2d, conv_2d_dw, conv2d_transpose (#24600)
* fix conflict

* fix format issue, rename

* rm debug code

* correct the file name
2026-06-18 09:40:03 +03:00
Aleksander Grygier 0b73fc79fe ui: Update code formatting command in pre-commit hook (#24685) 2026-06-18 08:33:50 +02:00
Ravi Panchumarthy 4a79037b8b ci : fix Windows x64 (OpenVINO) release link (#24731) 2026-06-18 08:30:08 +02:00
Georgi Gerganov cae0a3b0b0 metal : check for BF16 support in concat kernel (#24747) 2026-06-18 09:16:06 +03:00
Xuan-Son Nguyen f3e1828164 mtmd: llava_uhd should no longer use batch dim (#24732) 2026-06-17 22:40:50 +02:00
shalinib-ibm 2e88c49c90 ggml-cpu: Conditionally enable power11 backend based on compiler support (#24687)
* ggml: Conditionally enable power11 backend based on compiler support

Guard POWER11 backend creation behind a compiler flag check for -mcpu=power11. This avoids build failures on current GCC/Clang toolchains while preserving forward compatibility once POWER11 support becomes available.

* Update CMakeLists.txt

ggml-cpu: Use -mcpu=power10 for P10 and P11
2026-06-18 02:45:19 +08:00
Georgi Gerganov 0843245cb1 metal : implement rope_back operator (#24725)
Reuse existing rope kernels with a function constant to toggle forward/backward
rotation, avoiding duplicate kernel code.

Assisted-by: pi:llama.cpp/Qwen3.6-27B
2026-06-17 20:36:05 +03:00
Georgi Gerganov 8d2e580632 metal : add f16 and bf16 support for concat operator (#24724)
* metal : add f16 and bf16 support for concat operator

Extend the Metal backend concat operator to support f16 and bf16 tensor
types in addition to the existing f32 and i32 support.

- Template kernel_concat on type T with specializations for float, half,
  bfloat, and int
- Add type-specific pipeline getter ggml_metal_library_get_pipeline_concat()
- Update device support check to allow f16 unconditionally and bf16 when
  device supports bfloat16
- Update dispatch to select the correct kernel specialization by type

Assisted-by: pi:llama.cpp/Qwen3.6-27B

* metal : extend concat operator to support f16, bf16, i8, i16 and i64

Assisted-by: pi:llama.cpp/Qwen3.6-27B
2026-06-17 19:38:55 +03:00
Xuan-Son Nguyen 4b4d13ae72 server: (router) add model management API (#23976)
* wip

* server: (router) add SSE realtime updates API

* nits

* wip

* add download API

* add download api

* update docs

* add delete endpoint

* fix std::terminate

* fix crash

* fix 2

* add tests

* nits
2026-06-17 18:04:58 +02:00
Dev-iL b4024af6c2 llama : skip main_gpu validation when no devices are available (#23405) 2026-06-17 17:30:26 +03:00
Ruixiang Wang 1a2dea29b9 spec: fix segfault error on long prompts for eagle3 (#24707) 2026-06-17 17:29:49 +03:00
Neo Zhang 74a80dd9c0 [SYCL] add dev2dev memcpy by SYCL API (#24476)
* add dev2dev memcpy by SYCL API

* mv GGML_SYCL_DEV2DEV_MEMCPY to runntime table

* update the detect method for p2p comm

* fix the erro created during fix confilct

---------

Co-authored-by: Neo Zhang <NA>
2026-06-17 17:21:34 +03:00
Neo Zhang d1759e4156 [SYCL] Add conv_3d (#24691)
* add conv_3d

* optimize

* update ops.md

* restore test script

* rm unused code

* rm copyright notes
2026-06-17 17:20:01 +03:00
Julien Chaumond 8086439a4c webui: export conversations as jsonl (#24688)
* webui: export conversations as jsonl

each session is one jsonl file, a session header line followed by one line per message
exporting multiple conversations bundles them into a zip, one jsonl file each

* webui: import jsonl and zip conversation exports

parse the new jsonl session format and zip archives on import
keep supporting the legacy json format
2026-06-17 13:25:47 +02:00
Winston Ma 558e221b70 vulkan: record actual memory properties during buffer creation (#24326) 2026-06-17 11:14:48 +02:00
Ruben Ortlam ea21e03955 Revert "cuda: reset cuda context after reading memory size (#23935)" (#24715)
This reverts commit 0f7fada56b.
2026-06-17 10:59:35 +02:00
kononnable d5376cf5d7 ci: fix vulkan docker images (#24595)
* Update vulkan-shaders-gen.cpp

* Update vulkan-shaders-gen.cpp

add comment describing code change intention

* Update vulkan-shaders-gen.cpp

fix potential UB
2026-06-17 09:43:45 +02:00
Harapan Rachman bae36efa30 UI : fix SSE transport detection and routing through CORS proxy. Assi… (#24500)
* UI : fix SSE transport detection and routing through CORS proxy. Assisted-by: Antigravity

* ui : replace magic strings with constants in MCP transport handling
2026-06-17 08:26:30 +02:00
lhez 51571722aa opencl: optimize mul_mat_f16_f32_l4 for decode (#24504) 2026-06-16 23:21:26 -07:00
113 changed files with 7350 additions and 3440 deletions
+4 -1
View File
@@ -46,11 +46,13 @@ jobs:
steps:
- id: check
env:
COMMIT_MESSAGE: ${{ github.event.head_commit.message }}
run: |
if [[ "${{ github.event_name }}" == "workflow_dispatch" ]]; then
echo "should_release=true" >> $GITHUB_OUTPUT
elif [[ "${{ github.event_name }}" == "push" && "${{ github.ref }}" == "refs/heads/master" ]]; then
if echo "${{ github.event.head_commit.message }}" | grep -q '\[no release\]'; then
if echo "$COMMIT_MESSAGE" | grep -q '\[no release\]'; then
echo "should_release=false" >> $GITHUB_OUTPUT
else
echo "should_release=true" >> $GITHUB_OUTPUT
@@ -542,6 +544,7 @@ jobs:
steps:
- name: Set OpenVINO version output
id: openvino_version
shell: bash
run: echo "value=${{ env.OPENVINO_VERSION_MAJOR }}" >> $GITHUB_OUTPUT
- name: Clone
+26 -13
View File
@@ -20,16 +20,21 @@ int llama_fit_params(int argc, char ** argv);
int llama_quantize(int argc, char ** argv);
int llama_perplexity(int argc, char ** argv);
// hands the update over to the install script, which downloads and swaps the binary
// Self-update is only supported for binaries built with llama-install.sh
static int llama_update(int argc, char ** argv) {
(void) argc;
(void) argv;
#ifdef LLAMA_INSTALL_BUILD
#if defined(_WIN32)
return system("powershell -NoProfile -ExecutionPolicy Bypass -Command \"irm https://llama.app/install.ps1 | iex\"");
#else
return system("curl -fsSL https://llama.app/install.sh | sh");
#endif
#else
printf("Updates are available only when installed from https://llama.app\n");
return 1;
#endif
}
static const char * progname;
@@ -46,21 +51,29 @@ struct command {
int (*func)(int, char **);
};
#ifdef LLAMA_INSTALL_BUILD
#define UPDATE_HIDDEN false
#else
#define UPDATE_HIDDEN true
#endif
static const command cmds[] = {
{"serve", "HTTP API server", {"server"}, false, llama_server },
{"cli", "Command-line interactive interface", {"client"}, false, llama_cli },
{"update", "Update llama to the latest release", {}, false, llama_update },
{"completion", "Text completion", {"complete"}, true, llama_completion },
{"bench", "Benchmark prompt processing and text generation", {}, true, llama_bench },
{"batched-bench", "Benchmark batched decoding performance", {}, true, llama_batched_bench},
{"fit-params", "Compute parameters to fit a model in device memory", {}, true, llama_fit_params },
{"quantize", "Quantize a model", {}, true, llama_quantize },
{"perplexity", "Compute model perplexity and KL divergence", {}, true, llama_perplexity },
{"version", "Show version", {}, false, version },
{"licenses", "Show third-party licenses", {"credits"}, false, licenses },
{"help", "Show available commands", {}, false, help },
{"serve", "HTTP API server", {"server"}, false, llama_server },
{"cli", "Command-line interactive interface", {"client"}, false, llama_cli },
{"update", "Update llama to the latest release", {}, UPDATE_HIDDEN, llama_update },
{"completion", "Text completion", {"complete"}, true, llama_completion },
{"bench", "Benchmark prompt processing and text generation", {}, true, llama_bench },
{"batched-bench", "Benchmark batched decoding performance", {}, true, llama_batched_bench},
{"fit-params", "Compute parameters to fit a model in device memory", {}, true, llama_fit_params },
{"quantize", "Quantize a model", {}, true, llama_quantize },
{"perplexity", "Compute model perplexity and KL divergence", {}, true, llama_perplexity },
{"version", "Show version", {}, false, version },
{"licenses", "Show third-party licenses", {"credits"}, false, licenses },
{"help", "Show available commands", {}, false, help },
};
#undef UPDATE_HIDDEN
static int version(int argc, char ** argv) {
printf("%s\n", llama_build_info());
return 0;
+34 -78
View File
@@ -285,58 +285,15 @@ static std::string clean_file_name(const std::string & fname) {
return clean_fname;
}
static bool common_params_handle_remote_preset(common_params & params, llama_example ex) {
GGML_ASSERT(!params.model.hf_repo.empty());
// the returned hf_repo is without tag
auto [hf_repo, hf_tag] = common_download_split_repo_tag(params.model.hf_repo);
// "latest" tag (default if not specified) is translated to "default" preset
if (hf_tag == "latest") {
hf_tag = "default";
}
std::string model_endpoint = common_get_model_endpoint();
auto preset_url = model_endpoint + hf_repo + "/resolve/main/preset.ini";
// prepare local path for caching
auto preset_fname = clean_file_name(hf_repo + "_preset.ini");
auto preset_path = fs_get_cache_file(preset_fname);
common_download_opts opts;
opts.bearer_token = params.hf_token;
opts.offline = params.offline;
LOG_TRC("%s: looking for remote preset at %s\n", __func__, preset_url.c_str());
const int status = common_download_file_single(preset_url, preset_path, opts);
const bool has_preset = status >= 200 && status < 400;
// remote preset is optional, so we don't error out if not found
if (has_preset) {
LOG_TRC("%s: applying remote preset from %s\n", __func__, preset_url.c_str());
common_preset_context ctx(ex, /* only_remote_allowed */ true);
common_preset global;
auto remote_presets = ctx.load_from_ini(preset_path, global);
remote_presets = ctx.cascade(global, remote_presets);
if (remote_presets.find(hf_tag) != remote_presets.end()) {
common_preset preset = remote_presets.at(hf_tag);
LOG_INF("\n%s", preset.to_ini().c_str()); // to_ini already added trailing newline
preset.apply_to_params(params);
} else {
throw std::runtime_error("Remote preset.ini does not contain [" + std::string(hf_tag) + "] section");
}
} else {
LOG_TRC("%s: no remote preset found, skipping\n", __func__);
}
return has_preset;
}
struct handle_model_result {
bool found_mmproj = false;
common_params_model mmproj;
bool found_mtp = false;
common_params_model mtp;
bool found_preset = false;
std::string preset_path;
};
static handle_model_result common_params_handle_model(struct common_params_model & model,
@@ -355,6 +312,12 @@ static handle_model_result common_params_handle_model(struct common_params_model
common_download_opts hf_opts = opts;
auto download_result = common_download_model(model, hf_opts);
if (!download_result.preset_path.empty()) {
result.found_preset = true;
result.preset_path = download_result.preset_path;
return result; // skip everything else if preset.ini is used
}
if (download_result.model_path.empty()) {
throw std::runtime_error("failed to download model from Hugging Face");
}
@@ -454,6 +417,17 @@ bool common_params_handle_models(common_params & params, llama_example curr_ex)
try {
auto res = common_params_handle_model(params.model, opts);
if (res.found_preset) {
if (!params.models_preset.empty()) {
throw std::invalid_argument("cannot use both --models-preset and -hf with a preset.ini file");
}
// if HF repo is a preset repo, we simply run server in router mode with the preset.ini file
params.models_preset_hf = params.model.hf_repo; // only for showing a warning
params.models_preset = res.preset_path;
params.model = common_params_model{}; // make sure to clear model, so server starts in router mode
return true;
}
if (params.no_mmproj) {
params.mmproj = {};
} else if (res.found_mmproj && params.mmproj.path.empty() && params.mmproj.url.empty()) {
@@ -601,30 +575,6 @@ static bool common_params_parse_ex(int argc, char ** argv, common_params_context
// parse the first time to get -hf option (used for remote preset)
parse_cli_args();
// export_graph_ops loads only metadata
const bool skip_model_download = ctx_arg.ex == LLAMA_EXAMPLE_EXPORT_GRAPH_OPS;
// maybe handle remote preset
if (!params.model.hf_repo.empty() && !skip_model_download) {
std::string cli_hf_repo = params.model.hf_repo;
bool has_preset = common_params_handle_remote_preset(params, ctx_arg.ex);
// special case: if hf_repo explicitly set by preset, we need to preserve it (ignore CLI value)
// this is useful when we have one HF repo pointing to other HF repos (one model - multiple GGUFs)
std::string preset_hf_repo = params.model.hf_repo;
bool preset_has_hf_repo = preset_hf_repo != cli_hf_repo;
if (has_preset) {
// re-parse CLI args to override preset values
parse_cli_args();
}
// preserve hf_repo from preset if needed
if (preset_has_hf_repo) {
params.model.hf_repo = preset_hf_repo;
}
}
postprocess_cpu_params(params.cpuparams, nullptr);
postprocess_cpu_params(params.cpuparams_batch, &params.cpuparams);
@@ -635,15 +585,21 @@ static bool common_params_parse_ex(int argc, char ** argv, common_params_context
throw std::invalid_argument("error: --prompt-cache-all not supported in interactive mode yet\n");
}
// handle model and download
if (!skip_model_download) {
common_params_handle_models(params, ctx_arg.ex);
}
// export_graph_ops loads only metadata
const bool skip_model_download = ctx_arg.ex == LLAMA_EXAMPLE_EXPORT_GRAPH_OPS;
// model is required (except for server)
// TODO @ngxson : maybe show a list of available models in CLI in this case
if (params.model.path.empty() && ctx_arg.ex != LLAMA_EXAMPLE_SERVER && !skip_model_download && !params.usage && !params.completion) {
throw std::invalid_argument("error: --model is required\n");
if (!skip_model_download) {
// handle model and download
common_params_handle_models(params, ctx_arg.ex);
// model is required (except for server)
// TODO @ngxson : maybe show a list of available models in CLI in this case
if (params.model.path.empty()
&& ctx_arg.ex != LLAMA_EXAMPLE_SERVER
&& !params.usage
&& !params.completion) {
throw std::invalid_argument("error: --model is required\n");
}
}
if (params.escape) {
+5 -4
View File
@@ -642,10 +642,11 @@ struct common_params {
std::vector<std::string> server_tools;
// router server configs
std::string models_dir = ""; // directory containing models for the router server
std::string models_preset = ""; // directory containing model presets for the router server
int models_max = 4; // maximum number of models to load simultaneously
bool models_autoload = true; // automatically load models when requested via the router server
std::string models_dir = ""; // directory containing models for the router server
std::string models_preset = ""; // directory containing model presets for the router server
int models_max = 4; // maximum number of models to load simultaneously
bool models_autoload = true; // automatically load models when requested via the router server
std::string models_preset_hf = ""; // show a warning about remote presets on router loaded (if not empty)
bool log_json = false;
+120 -17
View File
@@ -696,6 +696,7 @@ struct hf_plan {
hf_cache::hf_files model_files;
hf_cache::hf_file mmproj;
hf_cache::hf_file mtp;
hf_cache::hf_file preset; // if set, only this file is downloaded
};
static hf_plan get_hf_plan(const common_params_model & model,
@@ -717,6 +718,14 @@ static hf_plan get_hf_plan(const common_params_model & model,
return plan;
}
// if preset.ini exists in the repo root, download only that file
for (const auto & f : all) {
if (f.path == "preset.ini") {
plan.preset = f;
return plan;
}
}
hf_cache::hf_file primary;
if (!model.hf_file.empty()) {
@@ -794,14 +803,19 @@ common_download_model_result common_download_model(const common_params_model &
if (is_hf) {
hf = get_hf_plan(model, opts, download_mmproj, download_mtp);
for (const auto & f : hf.model_files) {
tasks.push_back({f.url, f.local_path});
}
if (!hf.mmproj.path.empty()) {
tasks.push_back({hf.mmproj.url, hf.mmproj.local_path});
}
if (!hf.mtp.path.empty()) {
tasks.push_back({hf.mtp.url, hf.mtp.local_path});
if (!hf.preset.path.empty()) {
// if preset.ini exists, only download that file alone
tasks.push_back({hf.preset.url, hf.preset.local_path});
} else {
for (const auto & f : hf.model_files) {
tasks.push_back({f.url, f.local_path});
}
if (!hf.mmproj.path.empty()) {
tasks.push_back({hf.mmproj.url, hf.mmproj.local_path});
}
if (!hf.mtp.path.empty()) {
tasks.push_back({hf.mtp.url, hf.mtp.local_path});
}
}
} else if (!model.url.empty()) {
tasks = get_url_tasks(model);
@@ -835,17 +849,22 @@ common_download_model_result common_download_model(const common_params_model &
}
if (is_hf) {
for (const auto & f : hf.model_files) {
hf_cache::finalize_file(f);
}
result.model_path = hf.primary.final_path;
if (!hf.preset.path.empty()) {
// if preset.ini is used, do not set other paths
result.preset_path = hf_cache::finalize_file(hf.preset);
} else {
for (const auto & f : hf.model_files) {
hf_cache::finalize_file(f);
}
result.model_path = hf.primary.final_path;
if (!hf.mmproj.path.empty()) {
result.mmproj_path = hf_cache::finalize_file(hf.mmproj);
}
if (!hf.mmproj.path.empty()) {
result.mmproj_path = hf_cache::finalize_file(hf.mmproj);
}
if (!hf.mtp.path.empty()) {
result.mtp_path = hf_cache::finalize_file(hf.mtp);
if (!hf.mtp.path.empty()) {
result.mtp_path = hf_cache::finalize_file(hf.mtp);
}
}
} else {
result.model_path = model.path;
@@ -997,3 +1016,87 @@ std::vector<common_cached_model_info> common_list_cached_models() {
return result;
}
bool common_download_remove(const std::string & hf_repo_with_tag) {
namespace fs = std::filesystem;
auto [repo_id, tag] = common_download_split_repo_tag(hf_repo_with_tag);
if (tag.empty()) {
return hf_cache::remove_cached_repo(repo_id);
}
std::string tag_upper = tag;
for (char & c : tag_upper) {
c = (char) std::toupper((unsigned char) c);
}
auto files = hf_cache::get_cached_files(repo_id);
if (files.empty()) {
return false;
}
// collect snapshot entries whose tag matches
std::vector<fs::path> to_remove;
for (const auto & f : files) {
auto split = get_gguf_split_info(f.path);
if (split.tag == tag_upper) {
to_remove.emplace_back(f.local_path);
}
}
if (to_remove.empty()) {
return false;
}
// resolve blob paths from symlinks before deleting snapshot entries
std::vector<fs::path> blobs_to_check;
for (const auto & p : to_remove) {
std::error_code ec;
if (fs::is_symlink(p, ec)) {
auto target = fs::read_symlink(p, ec);
if (!ec) {
blobs_to_check.push_back((p.parent_path() / target).lexically_normal());
}
}
}
// remove snapshot entries
for (const auto & p : to_remove) {
std::error_code ec;
fs::remove(p, ec);
if (ec) {
LOG_WRN("%s: failed to remove %s: %s\n", __func__, p.string().c_str(), ec.message().c_str());
}
}
if (blobs_to_check.empty()) {
return true;
}
// collect blobs still referenced by remaining snapshot entries
std::unordered_set<std::string> still_referenced;
for (const auto & f : hf_cache::get_cached_files(repo_id)) {
fs::path p(f.local_path);
std::error_code ec;
if (fs::is_symlink(p, ec)) {
auto target = fs::read_symlink(p, ec);
if (!ec) {
still_referenced.insert((p.parent_path() / target).lexically_normal().string());
}
}
}
// remove orphaned blobs
for (const auto & blob : blobs_to_check) {
if (still_referenced.find(blob.string()) == still_referenced.end()) {
std::error_code ec;
fs::remove(blob, ec);
if (ec) {
LOG_WRN("%s: failed to remove blob %s: %s\n", __func__, blob.string().c_str(), ec.message().c_str());
}
}
}
return true;
}
+8
View File
@@ -63,6 +63,7 @@ struct common_download_model_result {
std::string model_path;
std::string mmproj_path;
std::string mtp_path;
std::string preset_path;
};
// throw if the file is missing or invalid (e.g. ETag check failed)
@@ -115,3 +116,10 @@ int common_download_file_single(const std::string & url,
// resolve and download model from Docker registry
// return local path to downloaded model file
std::string common_docker_resolve_model(const std::string & docker);
// Remove a cached model from disk
// input format: "user/model" or "user/model:tag"
// - if tag is omitted, removes the entire repo cache directory
// - if tag is present, removes only files matching that tag (and orphaned blobs)
// returns true if anything was removed
bool common_download_remove(const std::string & hf_repo_with_tag);
+15
View File
@@ -495,4 +495,19 @@ std::string finalize_file(const hf_file & file) {
return file.final_path;
}
bool remove_cached_repo(const std::string & repo_id) {
if (!is_valid_repo_id(repo_id)) {
LOG_WRN("%s: invalid repository: %s\n", __func__, repo_id.c_str());
return false;
}
fs::path repo_path = get_repo_path(repo_id);
std::error_code ec;
auto removed = fs::remove_all(repo_path, ec);
if (ec) {
LOG_ERR("%s: failed to remove repo cache %s: %s\n", __func__, repo_path.string().c_str(), ec.message().c_str());
return false;
}
return removed > 0;
}
} // namespace hf_cache
+3
View File
@@ -29,4 +29,7 @@ hf_files get_cached_files(const std::string & repo_id = {});
// Create snapshot path (link or move/copy) and return it
std::string finalize_file(const hf_file & file);
// Remove the entire cached directory for a repo, returns true if removed
bool remove_cached_repo(const std::string & repo_id);
} // namespace hf_cache
+1 -49
View File
@@ -16,48 +16,6 @@ static std::string rm_leading_dashes(const std::string & str) {
return str.substr(pos);
}
// only allow a subset of args for remote presets for security reasons
// do not add more args unless absolutely necessary
// args that output to files are strictly prohibited
static std::set<std::string> get_remote_preset_whitelist(const std::map<std::string, common_arg> & key_to_opt) {
static const std::set<std::string> allowed_options = {
"model-url",
"hf-repo",
"hf-repo-draft",
"hf-repo-v", // vocoder
"hf-file-v", // vocoder
"mmproj-url",
"pooling",
"jinja",
"batch-size",
"ubatch-size",
"cache-reuse",
"chat-template-kwargs",
"mmap",
// note: sampling params are automatically allowed by default
// negated args will be added automatically if the positive arg is specified above
};
std::set<std::string> allowed_keys;
for (const auto & it : key_to_opt) {
const std::string & key = it.first;
const common_arg & opt = it.second;
if (allowed_options.find(key) != allowed_options.end() || opt.is_sampling) {
allowed_keys.insert(key);
// also add variant keys (args without leading dashes and env vars)
for (const auto & arg : opt.get_args()) {
allowed_keys.insert(rm_leading_dashes(arg));
}
for (const auto & env : opt.get_env()) {
allowed_keys.insert(env);
}
}
}
return allowed_keys;
}
std::vector<std::string> common_preset::to_args(const std::string & bin_path) const {
std::vector<std::string> args;
@@ -300,16 +258,10 @@ static std::string parse_bool_arg(const common_arg & arg, const std::string & ke
return value;
}
common_preset_context::common_preset_context(llama_example ex, bool only_remote_allowed)
common_preset_context::common_preset_context(llama_example ex)
: ctx_params(common_params_parser_init(default_params, ex)) {
common_params_add_preset_options(ctx_params.options);
key_to_opt = get_map_key_opt(ctx_params);
// setup allowed keys if only_remote_allowed is true
if (only_remote_allowed) {
filter_allowed_keys = true;
allowed_keys = get_remote_preset_whitelist(key_to_opt);
}
}
common_presets common_preset_context::load_from_ini(const std::string & path, common_preset & global) const {
+1 -1
View File
@@ -60,7 +60,7 @@ struct common_preset_context {
std::set<std::string> allowed_keys;
// if only_remote_allowed is true, only accept whitelisted keys
common_preset_context(llama_example ex, bool only_remote_allowed = false);
common_preset_context(llama_example ex);
// load presets from INI file
common_presets load_from_ini(const std::string & path, common_preset & global) const;
+3
View File
@@ -259,6 +259,9 @@ struct common_sampler * common_sampler_init(const struct llama_model * model, st
}
}
}
if (!grmr && !grammar_str.empty()) {
throw std::runtime_error("failed to parse grammar");
}
// Compute prefill tokens from the generation prompt
std::vector<llama_token> prefill_tokens;
+62 -2
View File
@@ -161,6 +161,64 @@ You could update your test result in it directly.
Please refer to [Docker with SYCL](../docker.md#docker-with-sycl) for details.
## Quick Development WOW
This chapter is for quick development & try with SYCL backend on Intel GPU.
You need to install following sofeware before development:
- Intel GPU driver
- oneAPI package
- other development tools.
Please refer to [Linux](#linux) or [Windows](#windows-1) for above installation and resolve the trouble in usage. There are the detailed guide.
- Linux
```
## build from source code
./examples/sycl/build.sh
## run CONV_2D_DW unit test cases
./build/bin/test-backend-ops -b SYCL0 -o CONV_2D_DW
## run all unit test cases
./build/bin/test-backend-ops -b SYCL0
## run with LLM on the first GPU
./examples/sycl/test.sh -mg 0 -m xxxx.gguf
## run service with LLM on the first GPU
export ONEAPI_DEVICE_SELECTOR="level_zero:0"
./examples/sycl/start-svr.sh -m xxxx.gguf
## update the docs/ops.md for new/update OPs
./examples/sycl/update-ops-doc.sh
```
- Windows
```
## build from source code
examples\sycl\win-build-sycl.bat
## run CONV_2D_DW unit test cases
build\bin\test-backend-ops.exe -b SYCL0 -o CONV_2D_DW
## run all unit test cases
build\bin\test-backend-ops.exe -b SYCL0
## run LLM on the first GPU
examples\sycl\win-test.bat -mg 0 -m xxxx.gguf
## run service with LLM on the first GPU
set ONEAPI_DEVICE_SELECTOR="level_zero:0"
examples\sycl\win-start-svr.bat -m xxxx.gguf
## update the docs/ops.md for new/update OPs
examples\sycl\win-update-ops-doc.bat
```
## Linux
### I. Setup Environment
@@ -701,7 +759,7 @@ use 1 SYCL GPUs: [0] with Max compute units:512
| GGML_SYCL_GRAPH | ON *(default)* \|OFF *(Optional)* | Enable build with [SYCL Graph extension](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc). |
| GGML_SYCL_DNN | ON *(default)* \|OFF *(Optional)* | Enable build with oneDNN. |
| GGML_SYCL_HOST_MEM_FALLBACK | ON *(default)* \|OFF *(Optional)* | Allow host memory fallback when device memory is full during quantized weight reorder. Enables inference to continue at reduced speed (reading over PCIe) instead of failing. Requires Linux kernel 6.8+. |
| GGML_SYCL_SUPPORT_LEVEL_ZERO | ON *(default)* \|OFF *(Optional)* | Enable Level Zero API for device memory allocation. Requires Level Zero headers/library at build time and Intel GPU driver (Level Zero runtime) at run time. Reduces system RAM usage during multi-GPU inference. |
| GGML_SYCL_SUPPORT_LEVEL_ZERO_API | ON *(default)* \|OFF *(Optional)* | Support to use Level Zero API for device memory allocation. Requires Level Zero headers/library at build time and Intel GPU driver (Level Zero runtime) at run time. Reduces system RAM usage during multi-GPU inference. SYCL backend always runs on Level Zero running time even if it's set as OFF (The SYCL api will be usage for memory allocation).|
| CMAKE_C_COMPILER | `icx` *(Linux)*, `icx/cl` *(Windows)* | Set `icx` compiler for SYCL code path. |
| CMAKE_CXX_COMPILER | `icpx` *(Linux)*, `icx` *(Windows)* | Set `icpx/icx` compiler for SYCL code path. |
@@ -712,10 +770,11 @@ use 1 SYCL GPUs: [0] with Max compute units:512
| Name | Value | Function |
|-------------------|------------------|---------------------------------------------------------------------------------------------------------------------------|
| GGML_SYCL_DEBUG | 0 (default) or 1 | Enable log function by macro: GGML_SYCL_DEBUG |
| GGML_SYCL_DEV2DEV_MEMCPY | 0 (default) or 1 | Choose the SYCL or L0 API in dev2dev memory copy.<br>Value: <br>* 0: SYCL API (default)<br>* 1: L0 API -- L0 API is found to lead to abnormal crash in some case. This debug flag is used to check the issue.|
| GGML_SYCL_ENABLE_FLASH_ATTN | 1 (default) or 0| Enable Flash-Attention. It can reduce memory usage. The performance impact depends on the LLM.|
| GGML_SYCL_DISABLE_OPT | 0 (default) or 1 | Disable optimize features for Intel GPUs. (Recommended to 1 for Intel devices older than Gen 10) |
| GGML_SYCL_DISABLE_GRAPH | 0 or 1 (default) | Disable running computations through SYCL Graphs feature. Disabled by default because SYCL Graph is still on development, no better performance. |
| GGML_SYCL_ENABLE_LEVEL_ZERO | 1 (default) or 0 | Use Level Zero API for device memory allocation instead of SYCL. Reduces system RAM usage on Intel dGPUs by avoiding DMA-buf/TTM host memory staging. Requires GGML_SYCL_SUPPORT_LEVEL_ZERO=ON at build time. |
| GGML_SYCL_USE_LEVEL_ZERO_API | 1 (default) or 0 | Use Level Zero API for device memory allocation instead of SYCL. Reduces system RAM usage on Intel dGPUs by avoiding DMA-buf/TTM host memory staging. Requires GGML_SYCL_SUPPORT_LEVEL_ZERO_API=ON at build time. SYCL backend always runs on Level Zero running time even if it's set as OFF (The SYCL api will be usage for memory allocation).|
| GGML_SYCL_DISABLE_DNN | 0 (default) or 1 | Disable running computations through oneDNN and always use oneMKL. |
| GGML_SYCL_ENABLE_VMM | 0 or 1 (default) | Enable the virtual-memory device pool. |
| ZES_ENABLE_SYSMAN | 0 (default) or 1 | Support to get free memory of GPU by sycl::aspect::ext_intel_free_memory.<br>Recommended to use when --split-mode = layer |
@@ -731,6 +790,7 @@ Pass these via `CXXFLAGS` or add a one-off `#define` to enable a flag on the spo
| DEBUG_SYCL_POOL | Enable device memory pool logging on teardown. Useful for profiling allocations. |
| DEBUG_SYCL_MALLOC | Enable verbose per-call logging of device pool alloc/free operations. |
## Design Rule
- Open to all contributors.
+3 -2
View File
@@ -1,10 +1,11 @@
# Multimodal
llama.cpp supports multimodal input via `libmtmd`. Currently, there are 2 tools support this feature:
- [llama-mtmd-cli](../tools/mtmd/README.md)
- [llama-cli](../tools/cli/README.md)
- [llama-server](../tools/server/README.md) via OpenAI-compatible `/chat/completions` API
- [llama-mtmd-cli](../tools/mtmd/README.md), for testing and development
Currently, we support **image** and **audio** input. Audio is highly experimental and may have reduced quality.
Currently, we support **image**, **audio** and **video** input.
To enable it, you can use one of the 2 methods below:
+4 -4
View File
@@ -27,11 +27,11 @@ Legend:
| COL2IM_1D | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| CONCAT | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 | ✅ | ✅ | ✅ | ❌ | ❌ |
| CONT | ❌ | 🟡 | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ | 🟡 | ❌ | ❌ |
| CONV_2D | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | | ✅ | ✅ | ❌ | ❌ |
| CONV_2D_DW | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | | ✅ | ❌ | ❌ | ❌ |
| CONV_3D | ❌ | ❌ | ✅ | ❌ | ✅ | ❌ | | ❌ | ❌ | ❌ | ❌ |
| CONV_2D | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | | ✅ | ✅ | ❌ | ❌ |
| CONV_2D_DW | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | | ✅ | ❌ | ❌ | ❌ |
| CONV_3D | ❌ | ❌ | ✅ | ❌ | ✅ | ❌ | | ❌ | ❌ | ❌ | ❌ |
| CONV_TRANSPOSE_1D | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| CONV_TRANSPOSE_2D | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | | ✅ | ❌ | ❌ | ❌ |
| CONV_TRANSPOSE_2D | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | | ✅ | ❌ | ❌ | ❌ |
| COS | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| COUNT_EQUAL | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| CPY | ❌ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | ❌ | ❌ |
+1840 -1840
View File
File diff suppressed because it is too large Load Diff
+36 -38
View File
@@ -8,55 +8,53 @@ The INI preset feature, introduced in [PR#17859](https://github.com/ggml-org/lla
When running multiple models on the server (router mode), INI preset files can be used to configure model-specific parameters. Please refer to the [server documentation](../tools/server/README.md) for more details.
### Using a Remote Preset
### Using a Hugging Face Preset
> [!NOTE]
> [!IMPORTANT]
>
> This feature is currently only supported via the `-hf` option.
> Please only use presets that you can trust! Unknown presets may be unsafe
For GGUF models hosted on Hugging Face, you can include a `preset.ini` file in the root directory of the repository to define specific configurations for that model.
You can push your preset to Hugging Face Hub and share with other users by:
1. Creating an empty model repository on Hugging Face
2. Creating a `preset.ini` file in the root directory of the repository
Example:
Example of a `preset.ini`:
```ini
hf-repo-draft = username/my-draft-model-GGUF
temp = 0.5
top-k = 20
top-p = 0.95
[*]
ctx-size = 0
mmap = 1
kv-unified = 1
parallel = 4
spec-default = 1
[Qwen3.5-4B]
hf = unsloth/Qwen3.5-4B-GGUF:Q4_K_M
ctx-size = 262144
batch-size = 2048
ubatch-size = 2048
top-p = 1.0
top-k = 0
min-p = 0.01
temp = 1.0
[gpt-oss-120b-hf]
hf = ggml-org/gpt-oss-120b-GGUF
ctx-size = 262144
batch-size = 2048
ubatch-size = 2048
top-p = 1.0
top-k = 0
min-p = 0.01
temp = 1.0
chat-template-kwargs = {"reasoning_effort": "high"}
```
For security reasons, only certain options are allowed. Please refer to [preset.cpp](../common/preset.cpp) for the complete list of permitted options.
Example usage:
Assuming your repository `username/my-model-with-preset` contains a `preset.ini` with the configuration above:
```sh
llama-cli -hf username/my-model-with-preset
# This is equivalent to:
llama-cli -hf username/my-model-with-preset \
--hf-repo-draft username/my-draft-model-GGUF \
--temp 0.5 \
--top-k 20 \
--top-p 0.95
```
You can also override preset arguments by specifying them on the command line:
The preset will be loaded similarly to the `--models-preset` option. Therefore, you can also override certain params via CLI arguments:
```sh
# Force temp = 0.1, overriding the preset value
llama-cli -hf username/my-model-with-preset --temp 0.1
```
If you want to define multiple preset configurations for one or more GGUF models, you can create a blank HF repo for each preset. Each HF repo should contain a `preset.ini` file that references the actual model(s):
```ini
hf-repo = user/my-model-main
hf-repo-draft = user/my-model-draft
temp = 0.8
ctx-size = 1024
; (and other configurations)
llama-cli -hf username/my-preset --temp 0.1
```
### Named presets
+9
View File
@@ -0,0 +1,9 @@
#!/bin/bash
# MIT license
# Copyright (C) 2026 Intel Corporation
# SPDX-License-Identifier: MIT
./build/bin/test-backend-ops support --output csv > docs/ops/SYCL.csv
./scripts/create_ops_docs.py
+8
View File
@@ -0,0 +1,8 @@
@echo off
rem MIT license
rem Copyright (C) 2026 Intel Corporation
rem SPDX-License-Identifier: MIT
build\bin\test-backend-ops support --output csv > docs\ops\SYCL.csv
python scripts\create_ops_docs.py
+1 -1
View File
@@ -249,7 +249,7 @@ option(GGML_SYCL "ggml: use SYCL"
option(GGML_SYCL_F16 "ggml: use 16 bit floats for sycl calculations" OFF)
option(GGML_SYCL_GRAPH "ggml: enable graphs in the SYCL backend" ON)
option(GGML_SYCL_HOST_MEM_FALLBACK "ggml: allow host memory fallback in SYCL reorder (requires kernel 6.8+)" ON)
option(GGML_SYCL_SUPPORT_LEVEL_ZERO "ggml: use Level Zero API in SYCL backend" ON)
option(GGML_SYCL_SUPPORT_LEVEL_ZERO_API "ggml: use Level Zero API in SYCL backend" ON)
option(GGML_SYCL_DNN "ggml: enable oneDNN in the SYCL backend" ON)
set (GGML_SYCL_TARGET "INTEL" CACHE STRING
"ggml: sycl target device")
+8 -1
View File
@@ -438,7 +438,14 @@ if (GGML_CPU_ALL_VARIANTS)
ggml_add_cpu_backend_variant(power8_2 POWER8 VSX)
ggml_add_cpu_backend_variant(power9 POWER9 VSX)
ggml_add_cpu_backend_variant(power10 POWER10 VSX)
ggml_add_cpu_backend_variant(power11 POWER11 VSX)
# POWER11 backend: only if compiler supports -mcpu=power11
check_cxx_compiler_flag("-mcpu=power11" GGML_CXX_SUPPORTS_POWER11)
if (GGML_CXX_SUPPORTS_POWER11)
message(STATUS "Compiler supports -mcpu=power11, enabling POWER11 backend")
ggml_add_cpu_backend_variant(power11 POWER11 VSX)
else()
message(STATUS "Skipping POWER11 backend: compiler does not support -mcpu=power11")
endif()
else()
message(FATAL_ERROR "Unsupported PowerPC target OS: ${CMAKE_SYSTEM_NAME}")
endif()
+1 -1
View File
@@ -389,7 +389,7 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
string(REGEX MATCHALL "POWER *([0-9]+)" MATCHED_STRING "${POWER10_M_UPPER}")
string(REGEX REPLACE "POWER *([0-9]+)" "\\1" EXTRACTED_NUMBER "${MATCHED_STRING}")
if (EXTRACTED_NUMBER GREATER_EQUAL 10)
if (EXTRACTED_NUMBER EQUAL 10 OR EXTRACTED_NUMBER EQUAL 11)
list(APPEND ARCH_FLAGS -mcpu=power10)
elseif (EXTRACTED_NUMBER EQUAL 9)
list(APPEND ARCH_FLAGS -mcpu=power9)
+9 -66
View File
@@ -622,18 +622,6 @@ ggml_backend_cuda_context::~ggml_backend_cuda_context() {
// cuda buffer
struct ggml_backend_cuda_device_context {
int device;
std::string name;
std::string description;
std::string pci_bus_id;
int op_offload_min_batch_size;
#if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
std::mutex device_mutex;
int active_count = 0;
#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
};
struct ggml_backend_cuda_buffer_context {
int device;
void * dev_ptr = nullptr;
@@ -651,13 +639,6 @@ struct ggml_backend_cuda_buffer_context {
static void ggml_backend_cuda_buffer_free_buffer(ggml_backend_buffer_t buffer) {
ggml_backend_cuda_buffer_context * ctx = (ggml_backend_cuda_buffer_context *)buffer->context;
#if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
ggml_backend_cuda_device_context * dev_ctx = (ggml_backend_cuda_device_context *) buffer->buft->device->context;
std::lock_guard<std::mutex> lock(dev_ctx->device_mutex);
dev_ctx->active_count--;
#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
delete ctx;
}
@@ -810,12 +791,6 @@ static ggml_backend_buffer_t ggml_backend_cuda_buffer_type_alloc_buffer(ggml_bac
ggml_backend_cuda_buffer_context * ctx = new ggml_backend_cuda_buffer_context(buft_ctx->device, dev_ptr);
#if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
ggml_backend_cuda_device_context * dev_ctx = (ggml_backend_cuda_device_context *) buft->device->context;
std::lock_guard<std::mutex> lock(dev_ctx->device_mutex);
dev_ctx->active_count++;
#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
return ggml_backend_buffer_init(buft, ggml_backend_cuda_buffer_interface, ctx, size);
}
@@ -1515,12 +1490,6 @@ static bool ggml_backend_buft_is_cuda_host(ggml_backend_buffer_type_t buft) {
}
static void ggml_backend_cuda_host_buffer_free_buffer(ggml_backend_buffer_t buffer) {
#if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
ggml_backend_cuda_device_context * dev_ctx = (ggml_backend_cuda_device_context *) buffer->buft->device->context;
std::lock_guard<std::mutex> lock(dev_ctx->device_mutex);
dev_ctx->active_count--;
#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
CUDA_CHECK(cudaFreeHost(buffer->context));
}
@@ -1529,8 +1498,6 @@ static void * ggml_cuda_host_malloc(size_t size) {
return nullptr;
}
ggml_cuda_set_device(0); // cudaMallocHost can create the implicit CUDA device context, make sure that this is consistently done on device 0.
void * ptr = nullptr;
cudaError_t err = cudaMallocHost((void **) &ptr, size);
if (err != cudaSuccess) {
@@ -1556,12 +1523,6 @@ static ggml_backend_buffer_t ggml_backend_cuda_host_buffer_type_alloc_buffer(ggm
buffer->buft = buft;
buffer->iface.free_buffer = ggml_backend_cuda_host_buffer_free_buffer;
#if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
ggml_backend_cuda_device_context * dev_ctx = (ggml_backend_cuda_device_context *) buft->device->context;
std::lock_guard<std::mutex> lock(dev_ctx->device_mutex);
dev_ctx->active_count++;
#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
return buffer;
}
@@ -3179,12 +3140,6 @@ static const char * ggml_backend_cuda_get_name(ggml_backend_t backend) {
static void ggml_backend_cuda_free(ggml_backend_t backend) {
ggml_backend_cuda_context * cuda_ctx = (ggml_backend_cuda_context *)backend->context;
#if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
ggml_backend_cuda_device_context * dev_ctx = (ggml_backend_cuda_device_context *) backend->device->context;
std::lock_guard<std::mutex> lock(dev_ctx->device_mutex);
dev_ctx->active_count--;
#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
delete cuda_ctx;
delete backend;
}
@@ -4916,6 +4871,14 @@ void ggml_backend_cuda_unregister_host_buffer(void * buffer) {
// backend device
struct ggml_backend_cuda_device_context {
int device;
std::string name;
std::string description;
std::string pci_bus_id;
int op_offload_min_batch_size;
};
static const char * ggml_backend_cuda_device_get_name(ggml_backend_dev_t dev) {
ggml_backend_cuda_device_context * ctx = (ggml_backend_cuda_device_context *)dev->context;
return ctx->name.c_str();
@@ -5004,11 +4967,6 @@ static bool ggml_backend_cuda_get_available_uma_memory(long * available_memory_k
static void ggml_backend_cuda_device_get_memory(ggml_backend_dev_t dev, size_t * free, size_t * total) {
ggml_backend_cuda_device_context * ctx = (ggml_backend_cuda_device_context *)dev->context;
#if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
std::lock_guard<std::mutex> lock(ctx->device_mutex);
#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
ggml_cuda_set_device(ctx->device);
CUDA_CHECK(cudaMemGetInfo(free, total));
@@ -5035,13 +4993,6 @@ static void ggml_backend_cuda_device_get_memory(ggml_backend_dev_t dev, size_t *
}
#endif // defined(__linux__)
#if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
// If no backends or buffers are active, the cudaMemGetInfo call above lazily created a CUDA
// context that permanently consumes VRAM. Reset the device to free it.
if (ctx->active_count == 0) {
CUDA_CHECK(cudaDeviceReset());
}
#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
}
static enum ggml_backend_dev_type ggml_backend_cuda_device_get_type(ggml_backend_dev_t dev) {
@@ -5745,21 +5696,13 @@ ggml_backend_t ggml_backend_cuda_init(int device) {
return nullptr;
}
ggml_backend_dev_t dev = ggml_backend_reg_dev_get(ggml_backend_cuda_reg(), device);
ggml_backend_t cuda_backend = new ggml_backend {
/* .guid = */ ggml_backend_cuda_guid(),
/* .iface = */ ggml_backend_cuda_interface,
/* .device = */ dev,
/* .device = */ ggml_backend_reg_dev_get(ggml_backend_cuda_reg(), device),
/* .context = */ ctx,
};
#if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
ggml_backend_cuda_device_context * dev_ctx = (ggml_backend_cuda_device_context *) dev->context;
std::lock_guard<std::mutex> lock(dev_ctx->device_mutex);
dev_ctx->active_count++;
#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
return cuda_backend;
}
+104 -12
View File
@@ -69,6 +69,7 @@ static int opt_opstage = HTP_OPSTAGE_QUEUE | HTP_OPSTAGE_COMPUTE;
static int opt_opbatch = 1024; // max number of ops in a batch
static int opt_opqueue = 16; // max number of pending batches
static int opt_oppoll = 0; // polling for batch completions
static int opt_optrace = 0; // trace buffer size per thread (0 means default)
static std::regex* opt_opfilter = NULL; // regex of ops to not claim
@@ -118,20 +119,39 @@ static void ggml_hexagon_dump_op_supp(const std::string &sess_name, const struct
ggml_op_desc(op), fmt.names, fmt.dims, fmt.types, fmt.strides, fmt.buffs, supp ? "yes" : "no");
}
static const char * htp_event_name(uint16_t id) {
switch (id) {
case HTP_TRACE_EVT_DMA: return "DMA";
case HTP_TRACE_EVT_HVX_COMP: return "HVX_COMP";
case HTP_TRACE_EVT_HVX_A_QUANT: return "HVX_A_QUANT";
case HTP_TRACE_EVT_HVX_A_PREP: return "HVX_A_PREP";
case HTP_TRACE_EVT_HVX_W_DEQUANT: return "HVX_W_DEQUANT";
case HTP_TRACE_EVT_HVX_W_PREP: return "HVX_W_PREP";
case HTP_TRACE_EVT_HVX_O_PROC: return "HVX_O_PROC";
case HTP_TRACE_EVT_HMX_COMP: return "HMX_COMP";
default: return "UNKNOWN";
}
}
static void ggml_hexagon_dump_op_prof(const std::string &sess_name, const htp_opnode & node,
uint32_t op_usec, uint32_t op_cycles, const uint32_t pmu[]) {
const htp_prof_desc & pd) {
if (!opt_profile) return;
uint32_t op_usec = pd.usecs;
uint32_t op_cycles = pd.cycles_stop - pd.cycles_start;
const uint32_t * pmu = pd.pmu;
char pmu_str[256] = "";
if (opt_profile > 1) {
if (opt_profile == 2) {
static_assert(HTP_PROF_PMU_NCNT == 8, "current implementation assumes 8 PMU counters");
sprintf(pmu_str, " pmu [%u,%u,%u,%u,%u,%u,%u,%u]",
pmu[0], pmu[1], pmu[2], pmu[3], pmu[4], pmu[5], pmu[6], pmu[7]);
}
htp_opformat fmt(node);
GGML_LOG_DEBUG("ggml-hex: %s profile-op %s: %s : %s : %s : %s : usec %u cycles %u%s\n", sess_name.c_str(),
node.op_name().c_str(), fmt.names, fmt.dims, fmt.types, fmt.strides, op_usec, op_cycles, pmu_str);
float mhz = op_usec > 0 ? (float) op_cycles / op_usec : 0.0f;
GGML_LOG_DEBUG("ggml-hex: %s profile-op %s: %s : %s : %s : %s : usec %u cycles %u start %u mhz %.1f%s\n", sess_name.c_str(),
node.op_name().c_str(), fmt.names, fmt.dims, fmt.types, fmt.strides, op_usec, op_cycles, pd.cycles_start, mhz, pmu_str);
}
// ** backend sessions
@@ -1995,10 +2015,16 @@ struct ggml_hexagon_opqueue {
size_t n_ops = batch_size;
size_t n_tensors = n_ops + n_ops * HTP_OP_MAX_INPUTS;
size_t tr_size = 0;
if (opt_profile == 3) {
tr_size = (HTP_MAX_NTHREADS + 1) * opt_optrace * sizeof(htp_trace_desc);
}
shm_blk_size = sizeof(htp_buf_desc) * n_bufs +
sizeof(htp_tensor) * n_tensors +
sizeof(htp_op_desc) * n_ops +
sizeof(htp_prof_desc) * n_ops;
sizeof(htp_prof_desc) * n_ops +
tr_size;
shm_buf = new ggml_hexagon_shared_buffer(sess, shm_blk_size * depth, true /* pinned */);
@@ -2042,11 +2068,19 @@ struct ggml_hexagon_opqueue {
const size_t o_size = sizeof(htp_op_desc) * req.n_ops;
const size_t p_size = sizeof(htp_prof_desc) * req.n_ops;
size_t tr_size = 0;
if (opt_profile == 3) {
req.n_traces = opt_optrace;
tr_size = (HTP_MAX_NTHREADS + 1) * req.n_traces * sizeof(htp_trace_desc);
} else {
req.n_traces = 0;
}
dbuf.ptr = shm_buf->base + (req.id * shm_blk_size);
dbuf.fd = shm_buf->fd;
dbuf.flags = DSPQUEUE_BUFFER_FLAG_FLUSH_SENDER | DSPQUEUE_BUFFER_FLAG_INVALIDATE_RECIPIENT;
dbuf.offset = (uint8_t*) dbuf.ptr - (uint8_t*) shm_buf->base;
dbuf.size = b_size + t_size + o_size + p_size;
dbuf.size = b_size + t_size + o_size + p_size + tr_size;
GGML_ASSERT(dbuf.size <= shm_blk_size);
@@ -2092,7 +2126,14 @@ struct ggml_hexagon_opqueue {
const size_t o_size = sizeof(htp_op_desc) * rsp.n_ops;
const size_t p_size = sizeof(htp_prof_desc) * rsp.n_ops;
const size_t m_size = b_size + t_size + o_size + p_size;
size_t tr_size = 0;
uint32_t n_traces = 0;
if (opt_profile == 3) {
n_traces = opt_optrace;
tr_size = (HTP_MAX_NTHREADS + 1) * n_traces * sizeof(htp_trace_desc);
}
const size_t m_size = b_size + t_size + o_size + p_size + tr_size;
GGML_ASSERT(m_size <= shm_blk_size);
HEX_VERBOSE("ggml-hex: %s op-queue pop batch #%u : n-bufs %u n-tensors %u n-ops %u : m-size %zu b-size %zu t-size %zu o-size %zu\n",
@@ -2111,13 +2152,62 @@ struct ggml_hexagon_opqueue {
GGML_ASSERT(rsp.n_ops <= ops.size());
const htp_prof_desc * pd = (const htp_prof_desc *) p_ptr;
for (uint32_t i = 0; i < rsp.n_ops; i++) {
htp_usec += pd[i].usecs;
ggml_hexagon_dump_op_prof(shm_buf->sess->name, ops[i], pd[i].usecs, pd[i].cycles, pd[i].pmu);
const htp_trace_desc * trace_events = nullptr;
if (opt_profile == 3) {
trace_events = (const htp_trace_desc *) (p_ptr + p_size);
}
GGML_LOG_DEBUG("ggml-hex: %s profile-batch n-ops %u batch-dur-usec %lld htp-ops-usec %u\n",
shm_buf->sess->c_name(), rsp.n_ops, (long long) batch_usec, htp_usec);
uint32_t trace_idx[HTP_MAX_NTHREADS + 1] = {0};
uint32_t valid_cnt[HTP_MAX_NTHREADS + 1] = {0};
if (opt_profile == 3) {
for (uint32_t t = 0; t <= HTP_MAX_NTHREADS; t++) {
uint32_t count = rsp.n_traces[t];
valid_cnt[t] = count > n_traces ? n_traces : count;
}
}
for (uint32_t i = 0; i < rsp.n_ops; i++) {
htp_usec += pd[i].usecs;
ggml_hexagon_dump_op_prof(shm_buf->sess->name, ops[i], pd[i]);
if (opt_profile == 3) {
uint32_t op_duration = pd[i].cycles_stop - pd[i].cycles_start;
for (uint32_t t = 0; t <= HTP_MAX_NTHREADS; t++) {
while (trace_idx[t] < valid_cnt[t]) {
const auto & e = trace_events[t * n_traces + trace_idx[t]];
uint32_t offset = e.cycles - pd[i].cycles_start;
if (offset >= 0x80000000) {
trace_idx[t]++;
continue;
}
if (offset > op_duration) {
break;
}
bool is_stop = (e.info & 0x8000) != 0;
uint16_t info = e.info & 0x7FFF;
GGML_LOG_DEBUG("ggml-hex: %s trace-op %s: thread %u event %s info %u %s %u\n",
shm_buf->sess->c_name(), ops[i].op_name().c_str(), t, htp_event_name(e.id), info, is_stop ? "stop" : "start", e.cycles);
trace_idx[t]++;
}
}
}
}
char evt_str[256] = "";
if (opt_profile == 3) {
sprintf(evt_str, " evt [%u,%u,%u,%u,%u,%u,%u,%u,%u,%u,%u]",
rsp.n_traces[0], rsp.n_traces[1], rsp.n_traces[2], rsp.n_traces[3],
rsp.n_traces[4], rsp.n_traces[5], rsp.n_traces[6], rsp.n_traces[7],
rsp.n_traces[8], rsp.n_traces[9], rsp.n_traces[10]);
}
GGML_LOG_DEBUG("ggml-hex: %s profile-batch n-ops %u batch-dur-usec %lld htp-ops-usec %u%s\n",
shm_buf->sess->c_name(), rsp.n_ops, (long long) batch_usec, htp_usec, evt_str);
}
}
};
@@ -3901,6 +3991,7 @@ static void ggml_hexagon_init(ggml_backend_reg * reg) {
const char * str_opbatch = getenv("GGML_HEXAGON_OPBATCH");
const char * str_opqueue = getenv("GGML_HEXAGON_OPQUEUE");
const char * str_oppoll = getenv("GGML_HEXAGON_OPPOLL");
const char * str_optrace = getenv("GGML_HEXAGON_OPTRACE");
const char * str_opfilter = getenv("GGML_HEXAGON_OPFILTER");
const char * str_profile = getenv("GGML_HEXAGON_PROFILE");
const char * str_etm = getenv("GGML_HEXAGON_ETM");
@@ -3939,6 +4030,7 @@ static void ggml_hexagon_init(ggml_backend_reg * reg) {
opt_opbatch = str_opbatch ? strtoul(str_opbatch, NULL, 0) : opt_opbatch;
opt_opqueue = str_opqueue ? strtoul(str_opqueue, NULL, 0) : opt_opqueue;
opt_oppoll = str_oppoll ? strtoul(str_oppoll, NULL, 0) : opt_oppoll;
opt_optrace = str_optrace ? strtoul(str_optrace, NULL, 0) : (opt_opbatch * 128);
opt_profile = str_profile ? atoi(str_profile) : 0;
opt_etm = str_etm ? atoi(str_etm) : 0;
opt_nhvx = str_nhvx ? strtoul(str_nhvx, NULL, 0) : opt_nhvx;
+1 -1
View File
@@ -37,8 +37,8 @@ list(FIND HTP_HMX_VERSIONS ${DSP_VERSION} _hmx_idx)
if (_hmx_idx GREATER_EQUAL 0)
target_sources(${HTP_LIB} PRIVATE
hmx-matmul-ops.c
hmx-flash-attn-ops.c
hmx-matmul-ops.c
hmx-queue.c
)
@@ -339,6 +339,9 @@ static void flash_attn_ext_f16_thread(unsigned int nth, unsigned int ith, void *
if (ir0 >= ir1) return;
struct htp_thread_trace * tr = octx->ctx ? &octx->ctx->trace[ith] : NULL;
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_COMP, ir0);
dma_queue * dma = octx->ctx->dma[ith];
const uint32_t DK = nek0;
@@ -615,6 +618,7 @@ static void flash_attn_ext_f16_thread(unsigned int nth, unsigned int ith, void *
hvx_copy_f16_f32_ua(dst_ptr, (uint8_t *) VKQ32, DV);
}
}
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_COMP, ir0);
}
int op_flash_attn_ext(struct htp_ops_context * octx) {
+10 -3
View File
@@ -6,6 +6,8 @@
#include <stdbool.h>
#include <stdint.h>
#include "hex-profile.h"
#ifdef __cplusplus
extern "C" {
#endif
@@ -88,6 +90,7 @@ typedef struct {
uint32_t pop_idx;
uint32_t capacity;
uint32_t idx_mask;
struct htp_thread_trace * trace;
} dma_queue;
dma_queue * dma_queue_create(size_t capacity);
@@ -152,6 +155,7 @@ static inline bool dma_queue_push_single_1d(dma_queue * q, dma_ptr dptr, size_t
q->dptr[q->push_idx] = dptr;
if (size) {
htp_trace_event_start(q->trace, HTP_TRACE_EVT_DMA, q->push_idx);
dmlink(q->tail, desc);
q->tail = (dma_descriptor_2d *) desc;
} else {
@@ -202,6 +206,7 @@ static inline bool dma_queue_push_single_2d(dma_queue * q, dma_ptr dptr, size_t
q->dptr[q->push_idx] = dptr;
if (nrows) {
htp_trace_event_start(q->trace, HTP_TRACE_EVT_DMA, q->push_idx);
dmlink(q->tail, desc);
q->tail = desc;
} else {
@@ -223,10 +228,12 @@ static inline dma_ptr dma_queue_pop(dma_queue * q) {
dma_descriptor_2d * desc = &q->desc[q->pop_idx];
// Wait for desc to complete
while (!desc->done) {
// FARF(ERROR, "dma-pop: waiting for DMA : %u\n", q->pop_idx);
dmpoll();
if (!desc->done) {
while (!desc->done) {
dmpoll();
}
}
htp_trace_event_stop(q->trace, HTP_TRACE_EVT_DMA, q->pop_idx);
dptr = q->dptr[q->pop_idx];
+64
View File
@@ -0,0 +1,64 @@
#ifndef HEX_PROFILE_H
#define HEX_PROFILE_H
#include <stdbool.h>
#include <stdint.h>
#include <qurt.h>
#include "hex-utils.h"
#include "htp-ops.h"
#define HTP_TRACE_EVT_START 0
#define HTP_TRACE_EVT_STOP 1
#ifndef HEX_NUM_PMU_COUNTERS
#define HEX_NUM_PMU_COUNTERS 8
#endif
static inline void hex_get_pmu(uint32_t counters[]) {
#if __HVX_ARCH__ >= 79
asm volatile("%0 = upmucnt0" : "=r"(counters[0]));
asm volatile("%0 = upmucnt1" : "=r"(counters[1]));
asm volatile("%0 = upmucnt2" : "=r"(counters[2]));
asm volatile("%0 = upmucnt3" : "=r"(counters[3]));
asm volatile("%0 = upmucnt4" : "=r"(counters[4]));
asm volatile("%0 = upmucnt5" : "=r"(counters[5]));
asm volatile("%0 = upmucnt6" : "=r"(counters[6]));
asm volatile("%0 = upmucnt7" : "=r"(counters[7]));
#else
counters[0] = qurt_pmu_get(QURT_PMUCNT0);
counters[1] = qurt_pmu_get(QURT_PMUCNT1);
counters[2] = qurt_pmu_get(QURT_PMUCNT2);
counters[3] = qurt_pmu_get(QURT_PMUCNT3);
counters[4] = qurt_pmu_get(QURT_PMUCNT4);
counters[5] = qurt_pmu_get(QURT_PMUCNT5);
counters[6] = qurt_pmu_get(QURT_PMUCNT6);
counters[7] = qurt_pmu_get(QURT_PMUCNT7);
#endif
}
struct htp_thread_trace {
uint32_t count;
uint32_t max_events;
struct htp_trace_desc * events;
};
static inline void htp_trace_event(struct htp_thread_trace * tr, uint16_t id, uint16_t info, uint32_t type) {
if (tr && tr->events && tr->count < tr->max_events) {
uint32_t idx = tr->count;
tr->events[idx].id = id;
tr->events[idx].info = info | (type == HTP_TRACE_EVT_STOP ? 0x8000 : 0);
tr->events[idx].cycles = (uint32_t) hex_get_cycles();
tr->count++;
}
}
static inline void htp_trace_event_start(struct htp_thread_trace * tr, uint16_t id, uint16_t info) {
htp_trace_event(tr, id, info, HTP_TRACE_EVT_START);
}
static inline void htp_trace_event_stop(struct htp_thread_trace * tr, uint16_t id, uint16_t info) {
htp_trace_event(tr, id, info, HTP_TRACE_EVT_STOP);
}
#endif /* HEX_PROFILE_H */
-27
View File
@@ -107,31 +107,4 @@ static inline void hex_pause() {
asm volatile(" pause(#255)\n");
}
#ifndef HEX_NUM_PMU_COUNTERS
#define HEX_NUM_PMU_COUNTERS 8
#endif
static inline void hex_get_pmu(uint32_t counters[]) {
#if __HVX_ARCH__ >= 79
asm volatile("%0 = upmucnt0" : "=r"(counters[0]));
asm volatile("%0 = upmucnt1" : "=r"(counters[1]));
asm volatile("%0 = upmucnt2" : "=r"(counters[2]));
asm volatile("%0 = upmucnt3" : "=r"(counters[3]));
asm volatile("%0 = upmucnt4" : "=r"(counters[4]));
asm volatile("%0 = upmucnt5" : "=r"(counters[5]));
asm volatile("%0 = upmucnt6" : "=r"(counters[6]));
asm volatile("%0 = upmucnt7" : "=r"(counters[7]));
#else
counters[0] = qurt_pmu_get(QURT_PMUCNT0);
counters[1] = qurt_pmu_get(QURT_PMUCNT1);
counters[2] = qurt_pmu_get(QURT_PMUCNT2);
counters[3] = qurt_pmu_get(QURT_PMUCNT3);
counters[4] = qurt_pmu_get(QURT_PMUCNT4);
counters[5] = qurt_pmu_get(QURT_PMUCNT5);
counters[6] = qurt_pmu_get(QURT_PMUCNT6);
counters[7] = qurt_pmu_get(QURT_PMUCNT7);
// qurt_pmu_get_pmucnt(counters);
#endif
}
#endif /* HEX_UTILS_H */
+26 -66
View File
@@ -18,7 +18,7 @@
#include "ggml-common.h"
#include "hex-dma.h"
#include "hex-fastdiv.h"
#include "hmx-profile.h"
#include "hex-profile.h"
#include "hmx-queue.h"
#include "hmx-utils.h"
#include "htp-ctx.h"
@@ -367,8 +367,11 @@ static void fa_k_interleave_thread(unsigned int n, unsigned int i, void * data)
return;
}
struct htp_thread_trace * tr = factx->octx->ctx ? &factx->octx->ctx->trace[i] : NULL;
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_COMP, start);
hmx_interleave_rows_to_tiles(factx->vtcm_k_tiles, factx->vtcm_k_fp16[args->buf_idx], total_rows, (int) factx->DK,
(int) args->src_stride, start, end);
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_COMP, start);
}
static void fa_phase_k_interleave(struct hmx_fa_context * factx, int kv_rows, size_t src_stride, size_t buf_idx) {
@@ -408,8 +411,11 @@ static void fa_v_interleave_thread(unsigned int n, unsigned int i, void * data)
__fp16 * v_tiles_dest = factx->use_pipeline ? factx->vtcm_v_tiles[args->buf_idx] : factx->vtcm_v_tiles[0];
struct htp_thread_trace * tr = factx->octx->ctx ? &factx->octx->ctx->trace[i] : NULL;
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_COMP, start);
hmx_interleave_cols_to_tiles(v_tiles_dest, factx->vtcm_v_fp16[args->buf_idx], total_rows, (int) factx->DV,
(int) args->src_stride, (int) args->n_col_tiles, start, end);
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_COMP, start);
}
static void fa_phase_v_interleave(struct hmx_fa_context * factx,
@@ -462,6 +468,9 @@ static void fa_q_load_thread(unsigned int n, unsigned int i, void * data) {
return;
}
struct htp_thread_trace * tr = factx->octx->ctx ? &factx->octx->ctx->trace[i] : NULL;
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_COMP, start);
const struct htp_tensor * q = args->q;
const uint32_t q_start = args->q_start;
const uint32_t kv_head = args->kv_head;
@@ -515,6 +524,7 @@ static void fa_q_load_thread(unsigned int n, unsigned int i, void * data) {
}
}
}
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_COMP, start);
}
static void fa_phase_q_load(struct hmx_fa_context * factx,
@@ -566,6 +576,9 @@ static void fa_o_store_thread(unsigned int n, unsigned int i, void * data) {
return;
}
struct htp_thread_trace * tr = factx->octx->ctx ? &factx->octx->ctx->trace[i] : NULL;
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_COMP, start);
const struct htp_tensor * dst = args->dst;
const __fp16 * o_tile_src = args->o_tile_src;
const uint32_t q_start = args->q_start;
@@ -611,6 +624,7 @@ static void fa_o_store_thread(unsigned int n, unsigned int i, void * data) {
}
}
}
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_COMP, start);
}
static void fa_phase_o_store(struct hmx_fa_context * factx,
@@ -680,6 +694,9 @@ static void fa_softmax_thread(unsigned int n, unsigned int i, void * data) {
return;
}
struct htp_thread_trace * tr = factx->octx->ctx ? &factx->octx->ctx->trace[i] : NULL;
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_COMP, vec_start);
// Per-thread row scratch: thread i uses bufs at offset i * 2 * stride
const size_t row_buf_stride = factx->row_buf_stride;
HVX_Vector * my_row_buf0 = factx->vtcm_row_bufs + i * 2 * row_buf_stride;
@@ -950,6 +967,7 @@ static void fa_softmax_thread(unsigned int n, unsigned int i, void * data) {
factx->vtcm_s_rowmax[r_vec_idx] = rowmax_acc_v;
factx->vtcm_p_rowsum[r_vec_idx] = rowsum_acc_v;
}
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_COMP, vec_start);
}
// Serial m/l update + build_D. Must run after softmax barrier (s_rowmax written by all threads).
@@ -1245,6 +1263,7 @@ static __attribute__((noinline)) void fa_compute_slopes(
// ============================================================================
int hmx_flash_attn_ext(struct htp_ops_context * octx) {
struct htp_thread_trace * tr = octx->ctx ? &octx->ctx->trace[HTP_MAX_NTHREADS] : NULL;
const struct htp_tensor * q = octx->src[0];
const struct htp_tensor * k = octx->src[1];
const struct htp_tensor * v = octx->src[2];
@@ -1422,19 +1441,6 @@ int hmx_flash_attn_ext(struct htp_ops_context * octx) {
return HTP_STATUS_OK;
}
// Profiling timers
TIMER_DEFINE(total);
TIMER_DEFINE(q_load);
TIMER_DEFINE(kv_dma);
TIMER_DEFINE(k_interleave);
TIMER_DEFINE(v_interleave);
TIMER_DEFINE(qk_dot);
TIMER_DEFINE(softmax);
TIMER_DEFINE(o_update);
TIMER_DEFINE(o_norm);
TIMER_DEFINE(o_store);
TIMER_START(total);
// ======== DMA setup ========
dma_queue * const dma = ctx->dma[0];
@@ -1474,12 +1480,10 @@ int hmx_flash_attn_ext(struct htp_ops_context * octx) {
const size_t n_row_tiles = g_br_actual / HMX_FP16_TILE_N_ROWS;
// ---- Load Q block [g_br, D] -> tiles, interleaving G heads ----
TIMER_START(q_load);
if (n_rows_g < g_br) {
hvx_splat_u8_a(factx.vtcm_q_tiles, 0, q_tile_bytes);
}
fa_phase_q_load(&factx, q, q_start, kv_head, ib3, n_rows_g);
TIMER_STOP(q_load);
// ---- Initialize per-block state ----
hvx_splat_u8_a(factx.vtcm_l_vec, 0, col_vec_bytes);
@@ -1558,10 +1562,8 @@ int hmx_flash_attn_ext(struct htp_ops_context * octx) {
const size_t n_col_tiles = hmx_ceil_div(kv_rows, HMX_FP16_TILE_N_COLS);
// Wait for current KV DMA
TIMER_START(kv_dma);
dma_queue_pop(dma); // K
dma_queue_pop(dma); // V
TIMER_STOP(kv_dma);
// Push mask DMA for this block (single 2D DMA when broadcast)
bool has_mask_dma = false;
@@ -1583,10 +1585,7 @@ int hmx_flash_attn_ext(struct htp_ops_context * octx) {
ou_job.DV = DV;
hmx_queue_push(hmx_q, hmx_queue_make_desc(hmx_fa_o_update_worker, &ou_job));
}
TIMER_START(k_interleave);
fa_phase_k_interleave(&factx, kv_rows, k_src_stride, buf_idx);
TIMER_STOP(k_interleave);
// ---- Phase 2: qk_dot(blk) on HMX ‖ V_int(blk) + DMA prefetch on HVX ----
qk_job.q_tiles = factx.vtcm_q_tiles;
@@ -1597,15 +1596,11 @@ int hmx_flash_attn_ext(struct htp_ops_context * octx) {
qk_job.n_dot_tiles = DK / 32;
qk_job.n_tiles_per_bc = n_tiles_per_bc;
qk_job.hmx_scales = factx.vtcm_hmx_scales_qk;
TIMER_START(qk_dot);
hmx_queue_push(hmx_q, hmx_queue_make_desc(hmx_fa_qk_dot_worker, &qk_job));
// DMA push next block (non-blocking, before worker_pool)
DMA_PREFETCH_KV(kv_blk + 1);
TIMER_START(v_interleave);
fa_phase_v_interleave(&factx, kv_rows, v_src_stride, buf_idx, n_tiles_per_bc);
TIMER_STOP(v_interleave);
// Pop and swap previous block's output update (deferred HMX pop)
if (kv_blk > 0) {
@@ -1615,7 +1610,6 @@ int hmx_flash_attn_ext(struct htp_ops_context * octx) {
// Pop current block's dot product job
hmx_queue_pop(hmx_q);
TIMER_STOP(qk_dot);
// ---- Phase 3: softmax(blk) + build_D(blk) | HMX idle ----
// Pop mask DMA before softmax (ensures VTCM buffer is ready)
@@ -1641,10 +1635,7 @@ int hmx_flash_attn_ext(struct htp_ops_context * octx) {
sargs.mask_vtcm = has_mask_dma ? (const __fp16 *) factx.vtcm_mask_buf : NULL;
sargs.mask_vtcm_row_stride = factx.mask_buf_row_stride;
sargs.slopes = factx.vtcm_slopes;
TIMER_START(softmax);
fa_phase_softmax_and_build_d(&factx, &sargs, n_row_tiles, n_row_tiles_g_br);
TIMER_STOP(softmax);
buf_idx = 1 - buf_idx;
} // end KV block loop (pipeline)
@@ -1664,11 +1655,8 @@ int hmx_flash_attn_ext(struct htp_ops_context * octx) {
ou_job.n_row_tiles_g_br = n_row_tiles_g_br;
ou_job.n_tiles_per_bc = n_tiles_per_bc;
ou_job.DV = DV;
TIMER_START(o_update);
hmx_queue_push(hmx_q, hmx_queue_make_desc(hmx_fa_o_update_worker, &ou_job));
hmx_queue_pop(hmx_q);
TIMER_STOP(o_update);
hex_swap_ptr((void **) &o_tile_curr, (void **) &o_tile_prev);
}
@@ -1683,23 +1671,14 @@ int hmx_flash_attn_ext(struct htp_ops_context * octx) {
const uint32_t kv_start = kv_blk * Bc;
const uint32_t kv_rows = hex_smin(Bc, nek1 - kv_start);
const size_t n_col_tiles = hmx_ceil_div(kv_rows, HMX_FP16_TILE_N_COLS);
TIMER_START(kv_dma);
dma_queue_pop(dma); // K
dma_queue_pop(dma); // V
TIMER_STOP(kv_dma);
bool has_mask_dma = false;
MASK_DMA_PUSH(kv_start, kv_rows, has_mask_dma);
DMA_PREFETCH_KV(kv_blk + 1);
// K interleave (multi-thread HVX)
TIMER_START(k_interleave);
fa_phase_k_interleave(&factx, kv_rows, k_src_stride, buf_idx);
TIMER_STOP(k_interleave);
// QK dot (inline HMX on main thread)
TIMER_START(qk_dot);
{
const size_t n_dot_tiles = (size_t) (DK / 32);
const __fp16 * restrict q_base = factx.vtcm_q_tiles;
@@ -1709,6 +1688,7 @@ int hmx_flash_attn_ext(struct htp_ops_context * octx) {
__builtin_assume(n_col_tiles > 0);
__builtin_assume(n_dot_tiles > 0);
htp_trace_event_start(tr, HTP_TRACE_EVT_HMX_COMP, HTP_MAX_NTHREADS);
Q6_bias_mxmem2_A((void *) factx.vtcm_hmx_scales_qk);
for (size_t r = 0; r < n_row_tiles; ++r) {
for (size_t c = 0; c < n_col_tiles; ++c) {
@@ -1724,8 +1704,8 @@ int hmx_flash_attn_ext(struct htp_ops_context * octx) {
Q6_mxmem_AR_after_hf(out_tile, 0);
}
}
htp_trace_event_stop(tr, HTP_TRACE_EVT_HMX_COMP, HTP_MAX_NTHREADS);
}
TIMER_STOP(qk_dot);
// Pop mask DMA
MASK_DMA_POP(has_mask_dma);
@@ -1751,21 +1731,9 @@ int hmx_flash_attn_ext(struct htp_ops_context * octx) {
sargs.mask_vtcm = has_mask_dma ? (const __fp16 *) factx.vtcm_mask_buf : NULL;
sargs.mask_vtcm_row_stride = factx.mask_buf_row_stride;
sargs.slopes = factx.vtcm_slopes;
TIMER_START(softmax);
fa_phase_softmax_and_build_d(&factx, &sargs, n_row_tiles, n_row_tiles_g_br);
TIMER_STOP(softmax);
// V interleave (multi-thread HVX)
TIMER_START(v_interleave);
// FIX(v-stride): use n_tiles_per_bc (block-invariant) as V tile layout
// stride to match o_update's v_tile access. Using per-block n_col_tiles
// misplaces DV_tile 1..3 in the last partial KV block.
fa_phase_v_interleave(&factx, kv_rows, v_src_stride, buf_idx, n_tiles_per_bc);
TIMER_STOP(v_interleave);
// O update (inline HMX on main thread)
TIMER_START(o_update);
{
const size_t DV_tiles = (size_t) (DV / 32);
const __fp16 * restrict d_base = factx.vtcm_d_tiles;
@@ -1777,6 +1745,7 @@ int hmx_flash_attn_ext(struct htp_ops_context * octx) {
__builtin_assume(n_col_tiles > 0);
__builtin_assume(DV_tiles > 0);
htp_trace_event_start(tr, HTP_TRACE_EVT_HMX_COMP, HTP_MAX_NTHREADS);
Q6_bias_mxmem2_A((void *) factx.vtcm_hmx_scales_id);
for (size_t r = 0; r < n_row_tiles; ++r) {
for (size_t c = 0; c < DV_tiles; ++c) {
@@ -1798,16 +1767,15 @@ int hmx_flash_attn_ext(struct htp_ops_context * octx) {
Q6_mxmem_AR_after_hf(o_tile_out, 0);
}
}
htp_trace_event_stop(tr, HTP_TRACE_EVT_HMX_COMP, HTP_MAX_NTHREADS);
hex_swap_ptr((void **) &o_tile_curr, (void **) &o_tile_prev);
}
TIMER_STOP(o_update);
buf_idx = 1 - buf_idx;
} // end KV block loop (fallback)
}
// ---- Final normalization: O = diag(1/l) @ O ----
TIMER_START(o_norm);
{
fa_build_d_diag_inv_l(&factx, n_row_tiles, n_row_tiles_g_br);
@@ -1830,6 +1798,7 @@ int hmx_flash_attn_ext(struct htp_ops_context * octx) {
__builtin_assume(n_row_tiles > 0);
__builtin_assume(DV_tiles > 0);
htp_trace_event_start(tr, HTP_TRACE_EVT_HMX_COMP, HTP_MAX_NTHREADS);
Q6_bias_mxmem2_A((void *) factx.vtcm_hmx_scales_id);
for (size_t r = 0; r < n_row_tiles; ++r) {
for (size_t c = 0; c < DV_tiles; ++c) {
@@ -1842,14 +1811,12 @@ int hmx_flash_attn_ext(struct htp_ops_context * octx) {
Q6_mxmem_AR_after_hf(o_out, 0);
}
}
htp_trace_event_stop(tr, HTP_TRACE_EVT_HMX_COMP, HTP_MAX_NTHREADS);
}
}
TIMER_STOP(o_norm);
// ---- Store O block ----
TIMER_START(o_store);
fa_phase_o_store(&factx, dst, o_tile_curr, q_start, kv_head, ib3, n_rows_g);
TIMER_STOP(o_store);
#undef MASK_DMA_PUSH
#undef MASK_DMA_POP
@@ -1865,14 +1832,7 @@ int hmx_flash_attn_ext(struct htp_ops_context * octx) {
HAP_compute_res_hmx_unlock(ctx->vtcm_rctx);
}
TIMER_STOP(total);
#if defined(ENABLE_PROFILE_TIMERS)
FARF(HIGH, "hmx-fa: %lld us, q_load=%lld kv_dma=%lld k_interleave=%lld v_interleave=%lld", TIMER_US(total),
TIMER_US(q_load), TIMER_US(kv_dma), TIMER_US(k_interleave), TIMER_US(v_interleave));
FARF(HIGH, " qk_dot=%lld softmax=%lld o_update=%lld o_norm=%lld o_store=%lld", TIMER_US(qk_dot), TIMER_US(softmax),
TIMER_US(o_update), TIMER_US(o_norm), TIMER_US(o_store));
#endif
return HTP_STATUS_OK;
}
+55 -41
View File
@@ -27,7 +27,7 @@
#include "hmx-ops.h"
#include "hmx-utils.h"
#include "hmx-queue.h"
#include "hmx-profile.h"
#include "hex-profile.h"
#include "vtcm-utils.h"
@@ -430,6 +430,7 @@ typedef struct {
int n_tasks;
int n_k_tiles;
struct fastdiv_values n_k_tiles_div;
struct htp_thread_trace * traces;
} x4x2_dequantize_state_t;
// Dequantize a tile range from x4x2 weight data (already in VTCM) to tile-major FP16.
@@ -533,11 +534,14 @@ static void dequantize_x4x2_weight_to_fp16_tiles_task_##suffix(
\
static void dequantize_x4x2_worker_loop_##suffix(unsigned int n, unsigned int i, void *data) { \
x4x2_dequantize_state_t *state = (x4x2_dequantize_state_t *)data; \
struct htp_thread_trace * tr = state->traces ? &state->traces[i] : NULL; \
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_W_DEQUANT, i); \
for (unsigned int task_id = i; task_id < (unsigned int)state->n_tasks; task_id += n) { \
int start = task_id * state->n_tiles_per_task; \
int end = hex_smin(start + state->n_tiles_per_task, state->n_tot_tiles); \
dequantize_x4x2_weight_to_fp16_tiles_task_##suffix(state, start, end); \
} \
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_W_DEQUANT, i); \
}
DEFINE_DEQUANTIZE_Q4_TASK(q4_0, q4_0_to_fp16_lut, q4_0, HMX_X4X2_DBLK_SIZE, (int)sizeof(__fp16))
@@ -657,11 +661,14 @@ static void dequantize_x4x2_weight_to_fp16_tiles_task_mxfp4(
static void dequantize_x4x2_worker_loop_mxfp4(unsigned int n, unsigned int i, void *data) {
x4x2_dequantize_state_t *state = (x4x2_dequantize_state_t *)data;
struct htp_thread_trace * tr = state->traces ? &state->traces[i] : NULL;
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_W_DEQUANT, i);
for (unsigned int task_id = i; task_id < (unsigned int)state->n_tasks; task_id += n) {
int start = task_id * state->n_tiles_per_task;
int end = hex_smin(start + state->n_tiles_per_task, state->n_tot_tiles);
dequantize_x4x2_weight_to_fp16_tiles_task_mxfp4(state, start, end);
}
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_W_DEQUANT, i);
}
static void dequantize_x4x2_weight_to_fp16_tiles_task_q8_0(
@@ -717,11 +724,14 @@ static void dequantize_x4x2_weight_to_fp16_tiles_task_q8_0(
static void dequantize_x4x2_worker_loop_q8_0(unsigned int n, unsigned int i, void *data) {
x4x2_dequantize_state_t *state = (x4x2_dequantize_state_t *)data;
struct htp_thread_trace * tr = state->traces ? &state->traces[i] : NULL;
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_W_DEQUANT, i);
for (unsigned int task_id = i; task_id < (unsigned int)state->n_tasks; task_id += n) {
int start = task_id * state->n_tiles_per_task;
int end = hex_smin(start + state->n_tiles_per_task, state->n_tot_tiles);
dequantize_x4x2_weight_to_fp16_tiles_task_q8_0(state, start, end);
}
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_W_DEQUANT, i);
}
static void convert_f16_weight_to_fp16_tiles_task(
@@ -773,11 +783,14 @@ static void convert_f16_weight_to_fp16_tiles_task(
static void convert_f16_worker_loop(unsigned int n, unsigned int i, void *data) {
x4x2_dequantize_state_t *state = (x4x2_dequantize_state_t *)data;
struct htp_thread_trace * tr = state->traces ? &state->traces[i] : NULL;
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_W_DEQUANT, i);
for (unsigned int task_id = i; task_id < (unsigned int)state->n_tasks; task_id += n) {
int start = task_id * state->n_tiles_per_task;
int end = hex_smin(start + state->n_tiles_per_task, state->n_tot_tiles);
convert_f16_weight_to_fp16_tiles_task(state, start, end);
}
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_W_DEQUANT, i);
}
static void quantize_f32_weight_to_fp16_tiles_task(
@@ -833,11 +846,14 @@ static void quantize_f32_weight_to_fp16_tiles_task(
static void quantize_f32_worker_loop(unsigned int n, unsigned int i, void *data) {
x4x2_dequantize_state_t *state = (x4x2_dequantize_state_t *)data;
struct htp_thread_trace * tr = state->traces ? &state->traces[i] : NULL;
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_W_DEQUANT, i);
for (unsigned int task_id = i; task_id < (unsigned int)state->n_tasks; task_id += n) {
int start = task_id * state->n_tiles_per_task;
int end = hex_smin(start + state->n_tiles_per_task, state->n_tot_tiles);
quantize_f32_weight_to_fp16_tiles_task(state, start, end);
}
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_W_DEQUANT, i);
}
@@ -868,6 +884,7 @@ static void dequantize_x4x2_weight_chunk_to_fp16_tiles(
state.weight_type = weight_type;
state.n_k_tiles = n_k_tiles;
state.n_k_tiles_div = n_k_tiles_div;
state.traces = ctx ? ctx->trace : NULL;
if (state.n_tasks == 1 || n_threads == 1) {
dequant_worker_fn(1, 0, &state);
@@ -985,10 +1002,13 @@ typedef struct {
int n_chunks_per_task;
int n_cols;
int n; // DDR row stride (total output columns)
struct htp_thread_trace * traces;
} output_transfer_task_state_t;
static void transfer_output_chunk_worker_fn(unsigned int n, unsigned int i, void *data) {
output_transfer_task_state_t *st = (output_transfer_task_state_t *) data;
struct htp_thread_trace * tr = st->traces ? &st->traces[i] : NULL;
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_O_PROC, i);
for (unsigned int task_id = i; task_id < (unsigned int)st->n_tasks; task_id += n) {
int chunk_idx = task_id * st->n_chunks_per_task;
@@ -998,6 +1018,7 @@ static void transfer_output_chunk_worker_fn(unsigned int n, unsigned int i, void
const __fp16 *vtcm_src = st->vtcm_src + chunk_idx * st->n_cols;
transfer_output_chunk_fp16_to_fp32(dst, vtcm_src, chunk_size, st->n_cols, st->n);
}
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_O_PROC, i);
}
static void transfer_output_chunk_threaded(struct htp_context *ctx, float *dst, const __fp16 *vtcm_src,
@@ -1015,6 +1036,7 @@ static void transfer_output_chunk_threaded(struct htp_context *ctx, float *dst,
state.vtcm_src = vtcm_src;
state.n_cols = n_cols;
state.n = n;
state.traces = ctx ? ctx->trace : NULL;
if (state.n_tasks == 1 || n_threads == 1) {
transfer_output_chunk_worker_fn(1, 0, &state);
@@ -1086,10 +1108,13 @@ typedef struct {
int n_chunks_per_task;
int k_block;
int k_stride;
struct htp_thread_trace * traces;
} activation_transfer_task_state_t;
static void transfer_activation_chunk_worker_fn(unsigned int n, unsigned int i, void *data) {
activation_transfer_task_state_t *st = (activation_transfer_task_state_t *) data;
struct htp_thread_trace * tr = st->traces ? &st->traces[i] : NULL;
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_A_PREP, i);
for (unsigned int task_id = i; task_id < (unsigned int)st->n_tasks; task_id += n) {
// one chunk: one row
@@ -1100,6 +1125,7 @@ static void transfer_activation_chunk_worker_fn(unsigned int n, unsigned int i,
const float *src = st->src + chunk_idx * st->k_stride;
transfer_activation_chunk_fp32_to_fp16(dst, src, chunk_size, st->k_block, st->k_stride);
}
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_A_PREP, i);
}
static void transfer_activation_chunk_threaded(struct htp_context *ctx, __fp16 *dst, const float *src, int n_rows, int k_block, int k_stride, int n_threads) {
@@ -1117,6 +1143,7 @@ static void transfer_activation_chunk_threaded(struct htp_context *ctx, __fp16 *
state.src = src;
state.k_block = k_block;
state.k_stride = k_stride;
state.traces = ctx ? ctx->trace : NULL;
if (state.n_tasks == 1 || n_threads == 1) {
transfer_activation_chunk_worker_fn(1, 0, &state);
@@ -1245,13 +1272,7 @@ int hmx_matmul_2d_f32(struct htp_context *ctx, float *restrict dst, const float
FARF(HIGH, "hmx-mm-2d: standard : m %d k %d n %d wtype %d mc %zu nc %zu vtcm %zu/%zu",
m, k, n, weight_type, m_chunk_n_rows, n_chunk_n_cols, vtcm_used, vtcm_budget);
TIMER_DEFINE(activation_load);
TIMER_DEFINE(weight_load);
TIMER_DEFINE(hmx_core);
TIMER_DEFINE(output_store);
TIMER_DEFINE(total);
TIMER_START(total);
int n_chunk_cnt = hmx_ceil_div(n, n_chunk_n_cols);
@@ -1370,7 +1391,12 @@ int hmx_matmul_2d_f32(struct htp_context *ctx, float *restrict dst, const float
dequantize_x4x2_weight_chunk_to_fp16_tiles(ctx, vtcm_scratch0, vtcm_weight, n_cols, k, row_stride, weight_type, n_k_tiles, n_k_tiles_div, dequant_worker_fn, num_threads);
// C: HMX Compute (Synchronous)
core_dot_chunk_fp16(vtcm_output, vtcm_activation, vtcm_scratch0, vtcm_scales, n_row_tiles, n_col_tiles, k / HMX_FP16_TILE_N_ROWS);
{
struct htp_thread_trace * tr = ctx ? &ctx->trace[HTP_MAX_NTHREADS] : NULL;
htp_trace_event_start(tr, HTP_TRACE_EVT_HMX_COMP, HTP_MAX_NTHREADS);
core_dot_chunk_fp16(vtcm_output, vtcm_activation, vtcm_scratch0, vtcm_scales, n_row_tiles, n_col_tiles, k / HMX_FP16_TILE_N_ROWS);
htp_trace_event_stop(tr, HTP_TRACE_EVT_HMX_COMP, HTP_MAX_NTHREADS);
}
// D: Output Store
float *output_chunk = dst + (mr * n + nc);
@@ -1380,18 +1406,7 @@ int hmx_matmul_2d_f32(struct htp_context *ctx, float *restrict dst, const float
HAP_compute_res_hmx_unlock(ctx->vtcm_rctx);
}
TIMER_STOP(total);
#if defined(ENABLE_PROFILE_TIMERS)
FARF(HIGH, "hex-mm-2d: %lld us : m %d k %d n %d", TIMER_US(total), m, k, n);
if (!use_pipeline) {
FARF(HIGH, " activation_load: %lld us, weight_load: %lld us, hmx_core: %lld us, output_store: %lld us",
TIMER_US(activation_load), TIMER_US(weight_load), TIMER_US(hmx_core), TIMER_US(output_store));
size_t weight_size = (size_t)n * row_stride;
float bandwidth = 1e-3f * weight_size / (float)TIMER_US(weight_load);
FARF(HIGH, " weight load bandwidth: %.2f GB/s", bandwidth);
}
#endif
return 0;
}
@@ -1523,13 +1538,7 @@ int hmx_matmul_f16_f32_batched(struct htp_context *ctx, const hmx_matmul_f16_f32
m_chunk_n_rows, n_chunk_n_cols,
(size_t) (vtcm_ptr - (uint8_t *) ctx->vtcm_base), vtcm_budget);
TIMER_DEFINE(activation_load);
TIMER_DEFINE(weight_load);
TIMER_DEFINE(hmx_core);
TIMER_DEFINE(output_store);
TIMER_DEFINE(total);
TIMER_START(total);
const size_t fp16_row_bytes = (size_t) params->k * sizeof(__fp16);
const size_t weight_row_bytes = (size_t) params->weight_stride * sizeof(__fp16);
@@ -1549,7 +1558,6 @@ int hmx_matmul_f16_f32_batched(struct htp_context *ctx, const hmx_matmul_f16_f32
// contiguous rows into a VTCM scratch buffer first, then HVX
// converts from the contiguous VTCM buffer. This avoids L2 cache
// thrashing from HVX loads at large strides.
TIMER_START(activation_load);
for (int g = 0; g < group_size; ++g) {
const float *activation_chunk = hmx_matmul_activation_batch_ptr(params, b2_base + g, b3) + mr * params->act_stride;
__fp16 *vtcm_act_g = vtcm_activation + (size_t) g * act_head_stride;
@@ -1569,7 +1577,6 @@ int hmx_matmul_f16_f32_batched(struct htp_context *ctx, const hmx_matmul_f16_f32
params->k, params->act_stride, ctx->n_threads);
}
}
TIMER_STOP(activation_load);
void *buf_curr = vtcm_scratch0;
void *buf_next = vtcm_scratch1;
@@ -1584,7 +1591,6 @@ int hmx_matmul_f16_f32_batched(struct htp_context *ctx, const hmx_matmul_f16_f32
const size_t n_cols = hex_smin((size_t) params->n - nc, n_chunk_n_cols);
const size_t n_col_tiles = hmx_ceil_div((int) n_cols, HMX_FP16_TILE_N_COLS);
TIMER_START(weight_load);
{
dma_queue_pop(ctx->dma[0]);
@@ -1601,24 +1607,22 @@ int hmx_matmul_f16_f32_batched(struct htp_context *ctx, const hmx_matmul_f16_f32
0, n_cols);
hex_swap_ptr(&buf_curr, &buf_next);
}
TIMER_STOP(weight_load);
// Reuse the interleaved weight for every q_head in this GQA group
for (int g = 0; g < group_size; ++g) {
TIMER_START(hmx_core);
{
const __fp16 * vtcm_act_g = vtcm_activation + (size_t) g * act_head_stride;
struct htp_thread_trace * tr = ctx ? &ctx->trace[HTP_MAX_NTHREADS] : NULL;
htp_trace_event_start(tr, HTP_TRACE_EVT_HMX_COMP, HTP_MAX_NTHREADS);
core_dot_chunk_fp16(vtcm_output, vtcm_act_g, vtcm_weight, vtcm_scales, n_row_tiles, n_col_tiles,
params->k / 32);
htp_trace_event_stop(tr, HTP_TRACE_EVT_HMX_COMP, HTP_MAX_NTHREADS);
}
TIMER_STOP(hmx_core);
TIMER_START(output_store);
{
float *output = hmx_matmul_dst_batch_ptr(params, b2_base + g, b3) + mr * params->dst_stride + nc;
transfer_output_chunk_threaded(ctx, output, vtcm_output, (int) n_rows, (int) n_cols, params->dst_stride, ctx->n_threads);
}
TIMER_STOP(output_store);
}
}
}
@@ -1627,14 +1631,7 @@ int hmx_matmul_f16_f32_batched(struct htp_context *ctx, const hmx_matmul_f16_f32
HAP_compute_res_hmx_unlock(ctx->vtcm_rctx);
TIMER_STOP(total);
#if defined(ENABLE_PROFILE_TIMERS)
FARF(HIGH, "%s: %lld us, m=%d k=%d n=%d group=%d", __func__, TIMER_US(total),
params->m, params->k, params->n, group_size);
FARF(HIGH, " activation_load: %lld us, weight_load: %lld us, hmx_core: %lld us, output_store: %lld us",
TIMER_US(activation_load), TIMER_US(weight_load), TIMER_US(hmx_core), TIMER_US(output_store));
#endif
return 0;
}
@@ -1668,6 +1665,7 @@ typedef struct {
size_t nb12;
int start_row;
int cne1;
struct htp_thread_trace *traces;
} activation_transfer_gathered_task_state_t;
typedef struct {
@@ -1684,6 +1682,7 @@ typedef struct {
size_t dst_nb2;
int start_row;
int cne1;
struct htp_thread_trace *traces;
} output_transfer_scattered_task_state_t;
static void transfer_activation_chunk_fp32_to_fp16_gathered(
@@ -1780,6 +1779,9 @@ static void transfer_activation_chunk_fp32_to_fp16_gathered(
static void transfer_activation_chunk_gathered_worker_fn(unsigned int n, unsigned int i, void *data) {
activation_transfer_gathered_task_state_t *st = data;
struct htp_thread_trace * tr = st->traces ? &st->traces[i] : NULL;
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_A_PREP, i);
int chunk_idx = i;
int chunk_size = st->n_chunks_per_task;
int start_row = st->start_row + chunk_idx * chunk_size;
@@ -1791,6 +1793,7 @@ static void transfer_activation_chunk_gathered_worker_fn(unsigned int n, unsigne
st->matrix_rows, st->cur_a, st->mapping_stride,
st->ne11, &st->ne11_div, st->nb11, st->nb12, st->cne1);
}
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_A_PREP, i);
}
static void transfer_activation_chunk_gathered_threaded(
@@ -1830,6 +1833,7 @@ static void transfer_activation_chunk_gathered_threaded(
.nb12 = nb12,
.start_row = start_row,
.cne1 = cne1,
.traces = ctx ? ctx->trace : NULL,
};
if (actual_threads <= 1) {
@@ -1895,6 +1899,9 @@ static void transfer_output_chunk_fp16_to_fp32_scattered(
static void transfer_output_chunk_scattered_worker_fn(unsigned int n, unsigned int i, void *data) {
output_transfer_scattered_task_state_t *st = data;
struct htp_thread_trace * tr = st->traces ? &st->traces[i] : NULL;
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_O_PROC, i);
int chunk_idx = i;
int chunk_size = st->n_chunks_per_task;
int start_row = st->start_row + chunk_idx * chunk_size;
@@ -1906,6 +1913,7 @@ static void transfer_output_chunk_scattered_worker_fn(unsigned int n, unsigned i
st->matrix_rows, st->cur_a, st->mapping_stride,
st->dst_nb1, st->dst_nb2, st->cne1);
}
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_O_PROC, i);
}
static void transfer_output_chunk_scattered_threaded(
@@ -1942,6 +1950,7 @@ static void transfer_output_chunk_scattered_threaded(
.dst_nb2 = dst_nb2,
.start_row = start_row,
.cne1 = cne1,
.traces = ctx ? ctx->trace : NULL,
};
if (actual_threads <= 1) {
@@ -2053,7 +2062,12 @@ int hmx_matmul_id_2d_f32(struct htp_context *ctx,
dequantize_x4x2_weight_chunk_to_fp16_tiles(ctx, vtcm_scratch0, vtcm_weight, n_cols, k, row_stride, weight_type, n_k_tiles, n_k_tiles_div, dequant_worker_fn, num_threads);
core_dot_chunk_fp16(vtcm_output, vtcm_activation, vtcm_scratch0, vtcm_scales, n_row_tiles, n_col_tiles, k / HMX_FP16_TILE_N_ROWS);
{
struct htp_thread_trace * tr = ctx ? &ctx->trace[HTP_MAX_NTHREADS] : NULL;
htp_trace_event_start(tr, HTP_TRACE_EVT_HMX_COMP, HTP_MAX_NTHREADS);
core_dot_chunk_fp16(vtcm_output, vtcm_activation, vtcm_scratch0, vtcm_scales, n_row_tiles, n_col_tiles, k / HMX_FP16_TILE_N_ROWS);
htp_trace_event_stop(tr, HTP_TRACE_EVT_HMX_COMP, HTP_MAX_NTHREADS);
}
transfer_output_chunk_scattered_threaded(
ctx, dst, vtcm_output, (int) mr, (int) n_rows, (int) n_cols,
-34
View File
@@ -1,34 +0,0 @@
// Conditional fine-grained profiling macros for HMX operations.
//
// Define ENABLE_PROFILE_TIMERS (via compiler flag or before including this
// header) to instrument sub-operation latencies with HAP qtimer. When the
// macro is not defined the TIMER_* helpers expand to nothing so there is zero
// overhead.
//
// Usage:
// TIMER_DEFINE(my_phase); // declare accumulator variable
// TIMER_START(my_phase); // snapshot start time
// ... work ...
// TIMER_STOP(my_phase); // accumulate elapsed ticks
// FARF(ALWAYS, "my_phase: %lld us", TIMER_US(my_phase));
#ifndef HMX_PROFILE_H
#define HMX_PROFILE_H
#include <HAP_perf.h>
// #define ENABLE_PROFILE_TIMERS
#if defined(ENABLE_PROFILE_TIMERS)
# define TIMER_DEFINE(name) int64_t name##_ticks = 0
# define TIMER_START(name) int64_t name##_t0 = HAP_perf_get_qtimer_count()
# define TIMER_STOP(name) name##_ticks += HAP_perf_get_qtimer_count() - name##_t0
# define TIMER_US(name) HAP_perf_qtimer_count_to_us(name##_ticks)
#else
# define TIMER_DEFINE(name)
# define TIMER_START(name)
# define TIMER_STOP(name)
# define TIMER_US(name) 0LL
#endif
#endif // HMX_PROFILE_H
+2
View File
@@ -44,7 +44,9 @@ static inline void hmx_queue_process(struct hmx_queue *q, bool* killed) {
case HMX_QUEUE_SUSPEND: hmx_unlock(q); break;
default:
hmx_lock(q);
htp_trace_event_start(q->trace, HTP_TRACE_EVT_HMX_COMP, ir);
d->func(d->data);
htp_trace_event_stop(q->trace, HTP_TRACE_EVT_HMX_COMP, ir);
break;
}
+2
View File
@@ -11,6 +11,7 @@
#include <HAP_farf.h>
#include "hex-utils.h"
#include "hex-profile.h"
#ifdef __cplusplus
extern "C" {
@@ -47,6 +48,7 @@ struct hmx_queue {
void * stack;
uint32_t hap_rctx;
bool hmx_locked;
struct htp_thread_trace * trace;
};
struct hmx_queue * hmx_queue_create(size_t capacity, uint32_t hap_rctx);
+2
View File
@@ -4,6 +4,7 @@
#include "hex-dma.h"
#include "hmx-queue.h"
#include "htp-ops.h"
#include "hex-profile.h"
#include "worker-pool.h"
#include <assert.h>
@@ -70,6 +71,7 @@ struct htp_context {
bool hmx_enabled;
bool etm;
uint32_t profiler;
struct htp_thread_trace trace[HTP_MAX_NTHREADS + 1];
uint8_t * vtcm_base;
size_t vtcm_size;
+31 -4
View File
@@ -146,10 +146,36 @@ struct htp_op_desc {
uint16_t dst; // Output tensor index
};
#ifndef HTP_MAX_NTHREADS
#define HTP_MAX_NTHREADS 10
#endif
#define HTP_TRACE_MAX_EVENTS 256
enum htp_profiler_mode {
HTP_PROF_DISABLED = 0,
HTP_PROF_BASIC = 1,
HTP_PROF_PMU = 2,
HTP_PROF_TRACE = 3,
};
enum htp_trace_event_id {
HTP_TRACE_EVT_DMA = 0,
HTP_TRACE_EVT_HVX_COMP = 20,
HTP_TRACE_EVT_HVX_A_QUANT = 21,
HTP_TRACE_EVT_HVX_A_PREP = 22,
HTP_TRACE_EVT_HVX_W_DEQUANT = 23,
HTP_TRACE_EVT_HVX_W_PREP = 24,
HTP_TRACE_EVT_HVX_O_PROC = 25,
HTP_TRACE_EVT_HMX_COMP = 40,
};
struct htp_trace_desc {
uint32_t cycles; // lower 32-bits of cycle counter
uint16_t id; // Event ID
uint16_t info; // bit 15: is_stop. bits 14-0: tile/chunk index or other metadata.
};
#define HTP_PROF_PMU_NCNT 8
@@ -158,8 +184,8 @@ enum htp_profiler_mode {
struct htp_prof_desc {
uint32_t opcode; // GGML/HTP Op
uint32_t usecs; // Number of usec
uint32_t cycles; // Number of cycles
uint32_t pad; // Unused
uint32_t cycles_start; // Start cycle counter
uint32_t cycles_stop; // Stop cycle counter
uint32_t pmu[HTP_PROF_PMU_NCNT]; // PMU counters
};
@@ -168,7 +194,7 @@ struct htp_opbatch_req {
uint32_t n_bufs; // Number of buffers
uint32_t n_tensors; // Number of tensors
uint32_t n_ops; // Number of ops
uint32_t flags; // unused
uint32_t n_traces; // Number of trace descriptors per thread
uint32_t pad; // unused
// struct htp_buf_desc bufs[]; -- dspqueue buf 0
// struct htp_tensor tensors[]; -- dspqueue buf 0
@@ -181,7 +207,8 @@ struct htp_opbatch_rsp {
uint32_t n_bufs; // Number of buffers
uint32_t n_tensors; // Number of tensors
uint32_t n_ops; // Number of op profile descriptors
uint32_t pad; // unused
uint32_t n_traces[HTP_MAX_NTHREADS + 1];
uint8_t pad[8]; // align to 8 bytes
// struct htp_prof_desc profs[]; -- dspqueue buf 0
};
+41 -9
View File
@@ -400,7 +400,9 @@ AEEResult htp_iface_start(remote_handle64 handle, uint32 sess_id, uint64 dsp_que
ctx->hmx_queue = NULL;
if (use_hmx) {
ctx->hmx_queue = hmx_queue_create(16, ctx->vtcm_rctx);
if (!ctx->hmx_queue) {
if (ctx->hmx_queue) {
ctx->hmx_queue->trace = &ctx->trace[HTP_MAX_NTHREADS];
} else {
FARF(ERROR, "hmx-queue-create failed");
ctx->hmx_enabled = false;
}
@@ -425,6 +427,9 @@ AEEResult htp_iface_start(remote_handle64 handle, uint32 sess_id, uint64 dsp_que
ctx->n_threads = n_hvx;
for (int i = 0; i < ctx->n_threads; i++) {
ctx->dma[i] = dma_queue_create(256); // queue depth
if (ctx->dma[i]) {
ctx->dma[i]->trace = &ctx->trace[i];
}
}
ctx->ddr_spad_size = 512 * 1024; // 512 KB
@@ -502,7 +507,8 @@ static void htp_error_callback(dspqueue_t queue, int error, void * context) {
struct profile_data {
uint64_t usecs;
uint64_t cycles;
uint64_t cycles_start;
uint64_t cycles_stop;
uint32_t pmu_counters[HEX_NUM_PMU_COUNTERS];
};
@@ -512,8 +518,9 @@ static inline void profile_start(uint32_t mode, struct profile_data * d) {
hex_get_pmu(d->pmu_counters);
// fallthrough
case HTP_PROF_BASIC:
case HTP_PROF_TRACE:
d->usecs = HAP_perf_get_qtimer_count();
d->cycles = hex_get_cycles();
d->cycles_start = hex_get_cycles();
break;
default:
break;
@@ -530,8 +537,9 @@ static inline void profile_stop(uint32_t mode, struct profile_data * d) {
}
// fallthrough
case HTP_PROF_BASIC:
case HTP_PROF_TRACE:
d->usecs = HAP_perf_qtimer_count_to_us(HAP_perf_get_qtimer_count() - d->usecs);
d->cycles = hex_get_cycles() - d->cycles;
d->cycles_stop = hex_get_cycles();
break;
default:
break;
@@ -845,14 +853,15 @@ static void htp_packet_callback(dspqueue_t queue, int error, void * context) {
const uint32_t t_size = sizeof(struct htp_tensor) * n_tens;
const uint32_t o_size = sizeof(struct htp_op_desc) * n_ops;
const uint32_t p_size = sizeof(struct htp_prof_desc) * n_ops;
const uint32_t tr_size = (HTP_MAX_NTHREADS + 1) * req.n_traces * sizeof(struct htp_trace_desc);
if (dbuf.size < b_size + t_size + o_size + p_size) {
FARF(ERROR, "invalid opbatch memory block size %u", dbuf.size);
if (dbuf.size < b_size + t_size + o_size + p_size + tr_size) {
FARF(ERROR, "invalid opbatch memory block size %u (req %u)", dbuf.size, b_size + t_size + o_size + p_size + tr_size);
break;
}
FARF(HIGH, "processing opbatch #%u: n-bufs %u n-tensors %u n-ops %u : m-size %u b-size %u t-size %u o-size %u", req.id,
n_bufs, n_tens, n_ops, dbuf.size, b_size, t_size, o_size);
FARF(HIGH, "processing opbatch #%u: n-bufs %u n-tensors %u n-ops %u n-traces %u : m-size %u b-size %u t-size %u o-size %u", req.id,
n_bufs, n_tens, n_ops, req.n_traces, dbuf.size, b_size, t_size, o_size);
// Setup descriptor pointers
uint8_t * m_ptr = dbuf.ptr;
@@ -869,6 +878,20 @@ static void htp_packet_callback(dspqueue_t queue, int error, void * context) {
octx->n_threads = ctx->n_threads;
octx->ctx = ctx;
if (ctx->profiler == HTP_PROF_TRACE) {
memset(ctx->trace, 0, sizeof(ctx->trace));
struct htp_trace_desc * trace_events = (struct htp_trace_desc *) (m_ptr + p_size);
for (int t = 0; t <= HTP_MAX_NTHREADS; t++) {
ctx->trace[t].events = &trace_events[t * req.n_traces];
ctx->trace[t].max_events = req.n_traces;
}
} else {
for (int t = 0; t <= HTP_MAX_NTHREADS; t++) {
ctx->trace[t].events = NULL;
ctx->trace[t].max_events = 0;
}
}
for (uint32_t i=0; i < n_ops; i++) {
struct profile_data prof;
@@ -886,7 +909,8 @@ static void htp_packet_callback(dspqueue_t queue, int error, void * context) {
if (ctx->profiler) {
pds[i].opcode = ops[i].opcode;
pds[i].usecs = prof.usecs;
pds[i].cycles = prof.cycles;
pds[i].cycles_start = prof.cycles_start;
pds[i].cycles_stop = prof.cycles_stop;
for (int j = 0; j < HEX_NUM_PMU_COUNTERS; j++) {
pds[i].pmu[j] = prof.pmu_counters[j];
}
@@ -899,6 +923,14 @@ static void htp_packet_callback(dspqueue_t queue, int error, void * context) {
rsp.n_bufs = n_bufs;
rsp.n_tensors = n_tens;
rsp.n_ops = n_ops;
memset(rsp.pad, 0, sizeof(rsp.pad));
if (ctx->profiler == HTP_PROF_TRACE) {
for (int t = 0; t <= HTP_MAX_NTHREADS; t++) {
rsp.n_traces[t] = ctx->trace[t].count;
}
} else {
memset(rsp.n_traces, 0, sizeof(rsp.n_traces));
}
dbuf.flags = DSPQUEUE_BUFFER_FLAG_FLUSH_SENDER | DSPQUEUE_BUFFER_FLAG_INVALIDATE_RECIPIENT;
+46
View File
@@ -3350,6 +3350,7 @@ static void vec_dot_f16_f32_uu_1x1(const int n, float * restrict s, const void *
static void matmul_4d(unsigned int nth, unsigned int ith, void * data) {
htp_matmul_preamble;
struct htp_thread_trace * tr = octx->ctx ? &octx->ctx->trace[ith] : NULL;
uint64_t t1, t2;
t1 = HAP_perf_get_qtimer_count();
@@ -3411,10 +3412,12 @@ static void matmul_4d(unsigned int nth, unsigned int ith, void * data) {
float * dst_col = (float *) ((uint8_t * restrict) dst->data + (i1 * nb1 + i2 * nb2 + i3 * nb3));
const uint32_t ir0_block_end = MIN(iir0 + blck_0, ir0_end);
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_COMP, iir0);
for (uint32_t ir0 = iir0; ir0 < ir0_block_end; ir0++) {
const uint8_t * restrict src0_row = src0_base + ir0 * nb01;
mmctx->vec_dot_1x1(ne00, &dst_col[ir0], src0_row, src1_col);
}
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_COMP, iir0);
}
}
}
@@ -3430,6 +3433,7 @@ static void matmul_4d(unsigned int nth, unsigned int ith, void * data) {
// src1 tensor is already in VTCM spad
static void matmul_2d(unsigned int nth, unsigned int ith, void * data) {
htp_matmul_preamble;
struct htp_thread_trace * tr = octx->ctx ? &octx->ctx->trace[ith] : NULL;
const uint32_t src0_nrows = ne01 * ne02 * ne03; // src0 rows
const uint32_t src1_nrows = ne11 * ne12 * ne13; // src1 rows
@@ -3477,6 +3481,8 @@ static void matmul_2d(unsigned int nth, unsigned int ith, void * data) {
for (uint32_t ir0 = src0_start_row; ir0 < src0_end_row_x2; ir0 += 2) {
const uint8_t * ss0 = dma_queue_pop(dma_queue).dst;
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_COMP, ir0);
// Process src1 columns in pairs (2×2 tiling)
uint32_t ir1 = 0;
for (; ir1 + 1 < src1_nrows; ir1 += 2) {
@@ -3494,6 +3500,8 @@ static void matmul_2d(unsigned int nth, unsigned int ith, void * data) {
mmctx->vec_dot_2x1(ne00, &dst_row[ir0], ss0, ss0 + src0_stride, src1_col);
}
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_COMP, ir0);
// Prefetch next (n + spad_nrows) row
const int pr0 = (ir0 + MM_SPAD_SRC0_NROWS);
const int is0 = (pr0 - src0_start_row) % MM_SPAD_SRC0_NROWS;
@@ -3511,12 +3519,14 @@ static void matmul_2d(unsigned int nth, unsigned int ith, void * data) {
src0_stride, src0_row_size, 1);
const uint8_t * ss0 = dma_queue_pop(dma_queue).dst;
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_COMP, ir0);
#pragma unroll(2)
for (uint32_t ir1 = 0; ir1 < src1_nrows; ++ir1) {
const uint8_t * restrict src1_col = (const uint8_t *) (src1_data + ir1 * src1_stride);
float * restrict dst_row = (float *) (dst->data + (ir1 * dst_row_size));
mmctx->vec_dot_1x1(ne00, &dst_row[ir0], ss0, src1_col);
}
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_COMP, ir0);
}
t2 = HAP_perf_get_qtimer_count();
@@ -3530,6 +3540,7 @@ static void matmul_2d(unsigned int nth, unsigned int ith, void * data) {
// q8x4x2 src1 tensor is already in VTCM spad
static void matvec_2d(unsigned int nth, unsigned int ith, void * data) {
htp_matmul_preamble;
struct htp_thread_trace * tr = octx->ctx ? &octx->ctx->trace[ith] : NULL;
const uint32_t src0_nrows = ne01;
@@ -3581,7 +3592,9 @@ static void matvec_2d(unsigned int nth, unsigned int ith, void * data) {
// Process src0 rows
for (uint32_t ir0 = src0_start_row; ir0 < src0_end_row_x4; ir0 += 4) {
const uint8_t * ss0 = dma_queue_pop(dma_queue).dst;
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_COMP, ir0);
mmctx->vec_dot_4x1(ne00, &tmp[ir0 - src0_start_row], ss0, ss0 + src0_stride, ss0 + 2 * src0_stride, ss0 + 3 * src0_stride, src1_col);
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_COMP, ir0);
// Prefetch next (n + spad_nrows) row
const uint32_t pr0 = (ir0 + MM_SPAD_SRC0_NROWS);
@@ -3599,7 +3612,9 @@ static void matvec_2d(unsigned int nth, unsigned int ith, void * data) {
dma_queue_push_ddr_to_vtcm(dma_queue, dma_make_ptr(spad_src0 + is0 * src0_stride, src0_row + ir0 * src0_row_size),
src0_stride, src0_row_size, 2);
const uint8_t * ss0 = dma_queue_pop(dma_queue).dst;
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_COMP, ir0);
mmctx->vec_dot_2x1(ne00, &tmp[ir0 - src0_start_row], ss0, ss0 + src0_stride, src1_col);
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_COMP, ir0);
ir0 += 2;
}
if (ir0 < src0_end_row) {
@@ -3607,7 +3622,9 @@ static void matvec_2d(unsigned int nth, unsigned int ith, void * data) {
dma_queue_push_ddr_to_vtcm(dma_queue, dma_make_ptr(spad_src0 + is0 * src0_stride, src0_row + ir0 * src0_row_size),
src0_stride, src0_row_size, 1);
const uint8_t * ss0 = dma_queue_pop(dma_queue).dst;
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_COMP, ir0);
mmctx->vec_dot_1x1(ne00, &tmp[ir0 - src0_start_row], ss0, src1_col);
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_COMP, ir0);
ir0 += 1;
}
} else {
@@ -3627,7 +3644,9 @@ static void matvec_2d(unsigned int nth, unsigned int ith, void * data) {
// Process src0 rows
for (uint32_t ir0 = src0_start_row; ir0 < src0_end_row_x2; ir0 += 2) {
const uint8_t * ss0 = dma_queue_pop(dma_queue).dst;
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_COMP, ir0);
mmctx->vec_dot_2x1(ne00, &tmp[ir0 - src0_start_row], ss0, ss0 + src0_stride, src1_col);
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_COMP, ir0);
// Prefetch next (n + spad_nrows) row
const uint32_t pr0 = (ir0 + MM_SPAD_SRC0_NROWS);
@@ -3645,7 +3664,9 @@ static void matvec_2d(unsigned int nth, unsigned int ith, void * data) {
dma_queue_push_ddr_to_vtcm(dma_queue, dma_make_ptr(spad_src0 + is0 * src0_stride, src0_row + ir0 * src0_row_size),
src0_stride, src0_row_size, 1);
const uint8_t * ss0 = dma_queue_pop(dma_queue).dst;
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_COMP, ir0);
mmctx->vec_dot_1x1(ne00, &tmp[ir0 - src0_start_row], ss0, src1_col);
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_COMP, ir0);
}
}
@@ -3669,6 +3690,7 @@ struct mmid_row_mapping {
// src1 tensor is already in VTCM spad
static void matmul_id(unsigned int nth, unsigned int ith, void * data) {
htp_matmul_preamble;
struct htp_thread_trace * tr = octx->ctx ? &octx->ctx->trace[ith] : NULL;
const struct htp_tensor * restrict ids = octx->src[2];
struct htp_spad * restrict src2_spad = &octx->src2_spad;
@@ -3735,6 +3757,7 @@ static void matmul_id(unsigned int nth, unsigned int ith, void * data) {
for (uint32_t ir0 = src0_start_row; ir0 < src0_end_row_x2; ir0 += 2) {
const uint8_t * ss0 = dma_queue_pop(dma_queue).dst;
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_COMP, ir0);
for (uint32_t cid = 0; cid < cne1; ++cid) {
struct mmid_row_mapping row_mapping = MMID_MATRIX_ROW(cur_a, cid);
const int rm1 = row_mapping.i1; // expert idx
@@ -3746,6 +3769,7 @@ static void matmul_id(unsigned int nth, unsigned int ith, void * data) {
mmctx->vec_dot_2x1(ne00, &dst_row[ir0], ss0, ss0 + src0_row_size_padded, src1_col);
}
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_COMP, ir0);
// Prefetch next (n + spad_nrows) row
const int pr0 = (ir0 + MM_SPAD_SRC0_NROWS);
@@ -3764,6 +3788,7 @@ static void matmul_id(unsigned int nth, unsigned int ith, void * data) {
src0_row_size_padded, src0_row_size, 1);
const uint8_t * ss0 = dma_queue_pop(dma_queue).dst;
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_COMP, ir0);
for (uint32_t cid = 0; cid < cne1; ++cid) {
struct mmid_row_mapping row_mapping = MMID_MATRIX_ROW(cur_a, cid);
const int rm1 = row_mapping.i1; // expert idx
@@ -3775,6 +3800,7 @@ static void matmul_id(unsigned int nth, unsigned int ith, void * data) {
mmctx->vec_dot_1x1(ne00, &dst_row[ir0], ss0, src1_col);
}
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_COMP, ir0);
}
}
@@ -3789,6 +3815,7 @@ static void matmul_id(unsigned int nth, unsigned int ith, void * data) {
// src1 tensor is already in VTCM spad
static void matvec_id(unsigned int nth, unsigned int ith, void * data) {
htp_matmul_preamble;
struct htp_thread_trace * tr = octx->ctx ? &octx->ctx->trace[ith] : NULL;
const struct htp_tensor * restrict ids = octx->src[2];
struct htp_spad * restrict src2_spad = &octx->src2_spad;
@@ -3847,7 +3874,9 @@ static void matvec_id(unsigned int nth, unsigned int ith, void * data) {
// Process src0 rows
for (uint32_t ir0 = src0_start_row; ir0 < src0_end_row_x2; ir0 += 2) {
const uint8_t * ss0 = dma_queue_pop(dma_queue).dst;
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_COMP, ir0);
mmctx->vec_dot_2x1(ne00, &dst_row[ir0], ss0, ss0 + src0_row_size_padded, src1_col);
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_COMP, ir0);
// Prefetch next (n + spad_nrows) row
const int pr0 = (ir0 + MM_SPAD_SRC0_NROWS);
@@ -3865,7 +3894,9 @@ static void matvec_id(unsigned int nth, unsigned int ith, void * data) {
dma_queue_push_ddr_to_vtcm(dma_queue, dma_make_ptr(spad_src0 + is0 * src0_row_size_padded, src0_row + ir0 * src0_row_size),
src0_row_size_padded, src0_row_size, 1);
const uint8_t * ss0 = dma_queue_pop(dma_queue).dst;
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_COMP, ir0);
mmctx->vec_dot_1x1(ne00, &dst_row[ir0], ss0, src1_col);
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_COMP, ir0);
}
}
@@ -4147,6 +4178,7 @@ static void quantize_row_f32_q8x4x2(float * restrict x, uint8_t * restrict y, ui
static void quantize_f32_q8x4x2(unsigned int nth, unsigned int ith, void * data) {
struct htp_matmul_context * mmctx = data;
struct htp_ops_context * octx = mmctx->octx;
struct htp_thread_trace * tr = octx->ctx ? &octx->ctx->trace[ith] : NULL;
const struct htp_tensor * src = octx->src[1];
uint8_t * restrict dst = octx->src1_spad.data;
@@ -4163,6 +4195,7 @@ static void quantize_f32_q8x4x2(unsigned int nth, unsigned int ith, void * data)
const uint32_t nrows = ne1 * ne2 * ne3; // total n_rows
const uint32_t ir_first = nrows_per_thread * ith; // first row
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_A_QUANT, ir_first);
const uint32_t ir_last = MIN(ir_first + nrows_per_thread, nrows); // last row
const size_t src_row_size = src->nb[1];
@@ -4189,6 +4222,7 @@ static void quantize_f32_q8x4x2(unsigned int nth, unsigned int ith, void * data)
FARF(HIGH, "quantize-f32-q8x4: %u/%u : n-rows %u (%u:%u) row-size %u -> %u usec %u\n", ith, nth, nrows, ir_first,
ir_last, src_row_size, dst_row_size, (unsigned) HAP_perf_qtimer_count_to_us(t2 - t1));
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_A_QUANT, ir_first);
}
static void quantize_row_f32_q8_1x4x2(float * restrict x, uint8_t * restrict y, uint32_t k) {
@@ -4219,6 +4253,7 @@ static void quantize_row_f32_q8_1x4x2(float * restrict x, uint8_t * restrict y,
static void quantize_f32_q8_1x4x2(unsigned int nth, unsigned int ith, void * data) {
struct htp_matmul_context * mmctx = data;
struct htp_ops_context * octx = mmctx->octx;
struct htp_thread_trace * tr = octx->ctx ? &octx->ctx->trace[ith] : NULL;
const struct htp_tensor * src = octx->src[1];
uint8_t * restrict dst = octx->src1_spad.data;
@@ -4235,6 +4270,7 @@ static void quantize_f32_q8_1x4x2(unsigned int nth, unsigned int ith, void * dat
const uint32_t nrows = ne1 * ne2 * ne3; // total n_rows
const uint32_t ir_first = nrows_per_thread * ith; // first row
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_A_QUANT, ir_first);
const uint32_t ir_last = MIN(ir_first + nrows_per_thread, nrows); // last row
const size_t src_row_size = src->nb[1];
@@ -4260,11 +4296,13 @@ static void quantize_f32_q8_1x4x2(unsigned int nth, unsigned int ith, void * dat
FARF(HIGH, "quantize-f32-q8_1x4: %u/%u : n-rows %u (%u:%u) row-size %u -> %u usec %u\n", ith, nth, nrows, ir_first,
ir_last, src_row_size, dst_row_size, (unsigned) HAP_perf_qtimer_count_to_us(t2 - t1));
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_A_QUANT, ir_first);
}
static void quantize_f32_f32(unsigned int nth, unsigned int ith, void * data) {
struct htp_matmul_context * mmctx = data;
struct htp_ops_context * octx = mmctx->octx;
struct htp_thread_trace * tr = octx->ctx ? &octx->ctx->trace[ith] : NULL;
const struct htp_tensor * src = octx->src[1];
uint8_t * restrict dst = octx->src1_spad.data;
@@ -4281,6 +4319,7 @@ static void quantize_f32_f32(unsigned int nth, unsigned int ith, void * data) {
const uint32_t nrows = ne1 * ne2 * ne3; // total n_rows
const uint32_t ir_first = nrows_per_thread * ith; // first row
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_A_QUANT, ir_first);
const uint32_t ir_last = MIN(ir_first + nrows_per_thread, nrows); // last row
const size_t src_row_size = ne0 * sizeof(float);
@@ -4301,11 +4340,13 @@ static void quantize_f32_f32(unsigned int nth, unsigned int ith, void * data) {
FARF(HIGH, "quantize-f32-f32: %u/%u : n-rows %u (%u:%u) row-size %u (%u) -> %u usec %u\n", ith, nth, nrows, ir_first,
ir_last, src_row_size, src_stride, dst_stride, (unsigned) HAP_perf_qtimer_count_to_us(t2 - t1));
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_A_QUANT, ir_first);
}
static void quantize_f32_f16(unsigned int nth, unsigned int ith, void * data) {
struct htp_matmul_context * mmctx = data;
struct htp_ops_context * octx = mmctx->octx;
struct htp_thread_trace * tr = octx->ctx ? &octx->ctx->trace[ith] : NULL;
const struct htp_tensor * src = octx->src[1];
uint8_t * restrict dst = octx->src1_spad.data;
@@ -4322,6 +4363,7 @@ static void quantize_f32_f16(unsigned int nth, unsigned int ith, void * data) {
const uint32_t nrows = ne1 * ne2 * ne3; // total n_rows
const uint32_t ir_first = nrows_per_thread * ith; // first row
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_A_QUANT, ir_first);
const uint32_t ir_last = MIN(ir_first + nrows_per_thread, nrows); // last row
const size_t src_row_size = ne0 * sizeof(float);
@@ -4342,12 +4384,14 @@ static void quantize_f32_f16(unsigned int nth, unsigned int ith, void * data) {
FARF(HIGH, "quantize-f32-f16: %u/%u : n-rows %u (%u:%u) row-size %u (%u) -> %u usec %u\n", ith, nth, nrows, ir_first,
ir_last, src_row_size, src_stride, dst_stride, (unsigned) HAP_perf_qtimer_count_to_us(t2 - t1));
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_A_QUANT, ir_first);
}
// TODO just a plain copy that should be done via the DMA during the Op setup
static void quantize_f16_f16(unsigned int nth, unsigned int ith, void * data) {
struct htp_matmul_context * mmctx = data;
struct htp_ops_context * octx = mmctx->octx;
struct htp_thread_trace * tr = octx->ctx ? &octx->ctx->trace[ith] : NULL;
const struct htp_tensor * src = octx->src[1];
uint8_t * restrict dst = octx->src1_spad.data;
@@ -4364,6 +4408,7 @@ static void quantize_f16_f16(unsigned int nth, unsigned int ith, void * data) {
const uint32_t nrows = ne1 * ne2 * ne3; // total n_rows
const uint32_t ir_first = nrows_per_thread * ith; // first row
htp_trace_event_start(tr, HTP_TRACE_EVT_HVX_A_QUANT, ir_first);
const uint32_t ir_last = MIN(ir_first + nrows_per_thread, nrows); // last row
const size_t src_row_size = ne0 * sizeof(float);
@@ -4384,6 +4429,7 @@ static void quantize_f16_f16(unsigned int nth, unsigned int ith, void * data) {
FARF(HIGH, "quantize-f16-f16: %u/%u : n-rows %u (%u:%u) row-size %u (%u) -> %u usec %u\n", ith, nth, nrows, ir_first,
ir_last, src_row_size, src_stride, dst_stride, (unsigned) HAP_perf_qtimer_count_to_us(t2 - t1));
htp_trace_event_stop(tr, HTP_TRACE_EVT_HVX_A_QUANT, ir_first);
}
+20 -3
View File
@@ -66,7 +66,6 @@ struct ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_base(ggml
const char * op_str = "undefined";
switch (op) {
case GGML_OP_ADD_ID: op_str = "add_id"; break;
case GGML_OP_CONCAT: op_str = "concat"; break;
default: GGML_ABORT("fatal error");
};
@@ -211,6 +210,21 @@ ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_repeat(ggml_meta
return res;
}
ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_concat(ggml_metal_library_t lib, ggml_type tsrc) {
char base[256];
char name[256];
snprintf(base, 256, "kernel_concat_%s", ggml_type_name(tsrc));
snprintf(name, 256, "%s", base);
ggml_metal_pipeline_with_params res = ggml_metal_library_get_pipeline(lib, name);
if (!res.pipeline) {
res = ggml_metal_library_compile_pipeline(lib, base, name, nullptr);
}
return res;
}
ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_unary(ggml_metal_library_t lib, const ggml_tensor * op) {
char base[256];
char name[256];
@@ -1689,7 +1703,9 @@ ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_norm(ggml_metal_
}
ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_rope(ggml_metal_library_t lib, const ggml_tensor * op) {
assert(op->op == GGML_OP_ROPE);
assert(op->op == GGML_OP_ROPE || op->op == GGML_OP_ROPE_BACK);
const bool is_back = op->op == GGML_OP_ROPE_BACK;
char base[256];
char name[256];
@@ -1713,13 +1729,14 @@ ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_rope(ggml_metal_
snprintf(base, 256, "kernel_rope_norm_%s", ggml_type_name(op->src[0]->type));
}
snprintf(name, 256, "%s_imrope=%d", base, is_imrope ? 1 : 0);
snprintf(name, 256, "%s_imrope=%d_is_back=%d", base, is_imrope ? 1 : 0, is_back ? 1 : 0);
ggml_metal_pipeline_with_params res = ggml_metal_library_get_pipeline(lib, name);
if (!res.pipeline) {
ggml_metal_cv_t cv = ggml_metal_cv_init();
ggml_metal_cv_set_bool(cv, is_imrope, FC_ROPE + 0);
ggml_metal_cv_set_bool(cv, is_back, FC_ROPE + 1);
res = ggml_metal_library_compile_pipeline(lib, base, name, cv);
+1
View File
@@ -115,6 +115,7 @@ struct ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_get_rows
struct ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_set_rows (ggml_metal_library_t lib, enum ggml_type tidx, enum ggml_type tdst);
struct ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_diag (ggml_metal_library_t lib, const struct ggml_tensor * op);
struct ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_repeat (ggml_metal_library_t lib, enum ggml_type tsrc);
struct ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_concat (ggml_metal_library_t lib, enum ggml_type tsrc);
struct ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_unary (ggml_metal_library_t lib, const struct ggml_tensor * op);
struct ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_glu (ggml_metal_library_t lib, const struct ggml_tensor * op);
struct ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_sum (ggml_metal_library_t lib, const struct ggml_tensor * op);
+17 -5
View File
@@ -1123,13 +1123,24 @@ bool ggml_metal_device_supports_op(ggml_metal_device_t dev, const struct ggml_te
return true;
case GGML_OP_CONCAT:
{
// kernel_concat copies one float-sized value per element.
// Other scalar types need a type-generic copy kernel first.
const enum ggml_type src0_type = op->src[0]->type;
const enum ggml_type src1_type = op->src[1]->type;
return src0_type == src1_type &&
src0_type == op->type &&
(src0_type == GGML_TYPE_F32 || src0_type == GGML_TYPE_I32);
if (src0_type != src1_type || src0_type != op->type) {
return false;
}
switch (src0_type) {
case GGML_TYPE_F32:
case GGML_TYPE_F16:
case GGML_TYPE_I8:
case GGML_TYPE_I16:
case GGML_TYPE_I32:
case GGML_TYPE_I64:
return true;
case GGML_TYPE_BF16:
return has_bfloat;
default:
return false;
}
}
case GGML_OP_ADD:
case GGML_OP_SUB:
@@ -1173,6 +1184,7 @@ bool ggml_metal_device_supports_op(ggml_metal_device_t dev, const struct ggml_te
case GGML_OP_RMS_NORM:
return has_simdgroup_reduction && (ggml_is_contiguous_rows(op->src[0]));
case GGML_OP_ROPE:
case GGML_OP_ROPE_BACK:
return true;
case GGML_OP_IM2COL:
return ggml_is_contiguous(op->src[1]) && op->src[1]->type == GGML_TYPE_F32 && (op->type == GGML_TYPE_F16 || op->type == GGML_TYPE_F32);
+2 -1
View File
@@ -375,6 +375,7 @@ static int ggml_metal_op_encode_impl(ggml_metal_op_t ctx, int idx) {
n_fuse = ggml_metal_op_norm(ctx, idx);
} break;
case GGML_OP_ROPE:
case GGML_OP_ROPE_BACK:
{
n_fuse = ggml_metal_op_rope(ctx, idx);
} break;
@@ -556,7 +557,7 @@ int ggml_metal_op_concat(ggml_metal_op_t ctx, int idx) {
/*.dim =*/ dim,
};
auto pipeline = ggml_metal_library_get_pipeline_base(lib, GGML_OP_CONCAT);
auto pipeline = ggml_metal_library_get_pipeline_concat(lib, op->type);
ggml_metal_encoder_set_pipeline(enc, pipeline);
ggml_metal_encoder_set_bytes (enc, &args, sizeof(args), 0);
+29 -12
View File
@@ -4358,6 +4358,7 @@ template [[host_name("kernel_mul_mv_bf16_bf16_short")]] kernel mul_mv_t_t_short_
#endif
constant bool FC_rope_is_imrope [[function_constant(FC_ROPE + 0)]];
constant bool FC_rope_is_back [[function_constant(FC_ROPE + 1)]];
static float rope_yarn_ramp(const float low, const float high, const int i0) {
const float y = (i0 / 2 - low) / max(0.001f, high - low);
@@ -4381,6 +4382,9 @@ static void rope_yarn(
}
*cos_theta = cos(theta) * mscale;
*sin_theta = sin(theta) * mscale;
if (FC_rope_is_back) {
*sin_theta *= -1.0f;
}
}
// Apparently solving `n_rot = 2pi * x * base^((2 * max_pos_emb) / n_dims)` for x, we get
@@ -7513,14 +7517,15 @@ template [[host_name("kernel_cpy_q5_0_f16")]] kernel cpy_q_f_t kernel_cpy_q_f32<
template [[host_name("kernel_cpy_q5_1_f16")]] kernel cpy_q_f_t kernel_cpy_q_f32<half4x4, block_q5_1, 2, dequantize_q5_1>;
template [[host_name("kernel_cpy_q8_0_f16")]] kernel cpy_q_f_t kernel_cpy_q_f32<half4x4, block_q8_0, 2, dequantize_q8_0>;
template<typename T>
kernel void kernel_concat(
constant ggml_metal_kargs_concat & args,
device const char * src0,
device const char * src1,
device char * dst,
uint3 tgpig[[threadgroup_position_in_grid]],
ushort3 tpitg[[thread_position_in_threadgroup]],
ushort3 ntg[[threads_per_threadgroup]]) {
constant ggml_metal_kargs_concat & args,
device const char * src0,
device const char * src1,
device char * dst,
uint3 tgpig[[threadgroup_position_in_grid]],
ushort3 tpitg[[thread_position_in_threadgroup]],
ushort3 ntg[[threads_per_threadgroup]]) {
const int i3 = tgpig.z;
const int i2 = tgpig.y;
@@ -7533,21 +7538,33 @@ kernel void kernel_concat(
int o[4] = {0, 0, 0, 0};
o[args.dim] = args.dim == 0 ? args.ne00 : (args.dim == 1 ? args.ne01 : (args.dim == 2 ? args.ne02 : args.ne03));
device const float * x;
for (int i0 = tpitg.x; i0 < args.ne0; i0 += ntg.x) {
device const T * x;
if (i0 < args.ne00 && i1 < args.ne01 && i2 < args.ne02 && i3 < args.ne03) {
x = (device const float *)(src0 + (i3 )*args.nb03 + (i2 )*args.nb02 + (i1 )*args.nb01 + (i0 )*args.nb00);
x = (device const T *)(src0 + (i3 )*args.nb03 + (i2 )*args.nb02 + (i1 )*args.nb01 + (i0 )*args.nb00);
} else {
x = (device const float *)(src1 + (i3 - o[3])*args.nb13 + (i2 - o[2])*args.nb12 + (i1 - o[1])*args.nb11 + (i0 - o[0])*args.nb10);
x = (device const T *)(src1 + (i3 - o[3])*args.nb13 + (i2 - o[2])*args.nb12 + (i1 - o[1])*args.nb11 + (i0 - o[0])*args.nb10);
}
device float * y = (device float *)(dst + i3*args.nb3 + i2*args.nb2 + i1*args.nb1 + i0*args.nb0);
device T * y = (device T *)(dst + i3*args.nb3 + i2*args.nb2 + i1*args.nb1 + i0*args.nb0);
*y = *x;
}
}
typedef decltype(kernel_concat<float>) kernel_concat_t;
template [[host_name("kernel_concat_f32")]] kernel kernel_concat_t kernel_concat<float>;
template [[host_name("kernel_concat_f16")]] kernel kernel_concat_t kernel_concat<half>;
#if defined(GGML_METAL_HAS_BF16)
template [[host_name("kernel_concat_bf16")]] kernel kernel_concat_t kernel_concat<bfloat>;
#endif
template [[host_name("kernel_concat_i8")]] kernel kernel_concat_t kernel_concat<char>;
template [[host_name("kernel_concat_i16")]] kernel kernel_concat_t kernel_concat<short>;
template [[host_name("kernel_concat_i32")]] kernel kernel_concat_t kernel_concat<int>;
template [[host_name("kernel_concat_i64")]] kernel kernel_concat_t kernel_concat<long>;
template<int nr0, typename args_t>
void kernel_mul_mv_q2_K_f32_impl(
args_t args,
+52 -6
View File
@@ -564,6 +564,9 @@ struct ggml_backend_opencl_context {
cl_kernel kernel_mul_mat_f16_f32_1row;
cl_kernel kernel_mul_mat_f16_f32;
cl_kernel kernel_mul_mat_f16_f32_l4;
cl_kernel kernel_mul_mat_f16_f32_l4_dr;
cl_kernel kernel_mul_mat_f16_f32_l4_dr_ls;
cl_kernel kernel_mul_mat_f16_f32_l4_dr_lq;
cl_kernel kernel_mul_mat_f16_f32_tiled;
cl_kernel kernel_adreno_xmem_pack_src_f32;
cl_kernel kernel_adreno_xmem_prepack_weight_f16;
@@ -1787,6 +1790,11 @@ static void load_cl_kernels(ggml_backend_opencl_context *backend_ctx) {
build_program_from_source(backend_ctx->context, backend_ctx->device, kernel_src.c_str(), compile_opts);
CL_CHECK((backend_ctx->kernel_mul_mat_f16_f32_l4 = clCreateKernel(backend_ctx->program_mul_mv_f16_f32_l4, "kernel_mul_mat_f16_f32_l4", &err), err));
CL_CHECK((backend_ctx->kernel_mul_mat_f16_f32_l4_dr = clCreateKernel(backend_ctx->program_mul_mv_f16_f32_l4, "kernel_mul_mat_f16_f32_l4_dr", &err), err));
if (backend_ctx->gpu_family == ADRENO) {
CL_CHECK((backend_ctx->kernel_mul_mat_f16_f32_l4_dr_ls = clCreateKernel(backend_ctx->program_mul_mv_f16_f32_l4, "kernel_mul_mat_f16_f32_l4_dr_ls", &err), err));
CL_CHECK((backend_ctx->kernel_mul_mat_f16_f32_l4_dr_lq = clCreateKernel(backend_ctx->program_mul_mv_f16_f32_l4, "kernel_mul_mat_f16_f32_l4_dr_lq", &err), err));
}
GGML_LOG_CONT(".");
}
@@ -14570,11 +14578,31 @@ static void ggml_cl_mul_mat(ggml_backend_t backend, const ggml_tensor * src0, co
}
if (src1t == GGML_TYPE_F32) {
// heuristic for packing more work for Adreno
const bool adreno_use_lane_split =
backend_ctx->gpu_family == ADRENO &&
ne11 == 1 &&
ne01 >= 8 &&
ne00 % 4 == 0 &&
r3 == 1 && r2 >= 1 && r2 <= 8 &&
(ne12 % r2) == 0;
if (ne11 * ne12 < 4) {
kernel = backend_ctx->kernel_mul_mat_f16_f32_1row;
} else if (adreno_use_lane_split && ne00 >= 64 && ne00 <= 128) {
kernel = backend_ctx->kernel_mul_mat_f16_f32_l4_dr_lq;
nrows = 1;
} else if (adreno_use_lane_split && r2 >= 2 && ne00 > 128 && ne00 <= 256) {
kernel = backend_ctx->kernel_mul_mat_f16_f32_l4_dr_ls;
nrows = 1;
} else if (ne00 >= 128 && ne01 >= 8 && ne00%4 == 0) {
kernel = backend_ctx->kernel_mul_mat_f16_f32_l4;
nrows = ne11;
if (ne11 == 1) {
kernel = backend_ctx->kernel_mul_mat_f16_f32_l4_dr;
nrows = 1; // not used by this kernel
} else {
kernel = backend_ctx->kernel_mul_mat_f16_f32_l4;
nrows = ne11;
}
} else {
kernel = backend_ctx->kernel_mul_mat_f16_f32;
nrows = 4;
@@ -15353,12 +15381,30 @@ static void ggml_cl_mul_mat(ggml_backend_t backend, const ggml_tensor * src0, co
backend_ctx->enqueue_ndrange_kernel(kernel, 3, global_work_size, local_work_size, dst);
} else {
int64_t ny = (ne11 + nrows - 1)/nrows;
if (kernel == backend_ctx->kernel_mul_mat_f16_f32_l4_dr) {
const int NDST_DR = 4;
size_t global_work_size[] = {(size_t)CEIL_DIV(ne01, NDST_DR)*nth0, (size_t)nth1, (size_t)ne12*ne13};
size_t local_work_size[] = {(size_t)nth0, (size_t)nth1, 1};
size_t global_work_size[] = {(size_t)ne01*nth0, (size_t)ny*nth1, (size_t)ne12*ne13};
size_t local_work_size[] = {(size_t)nth0, (size_t)nth1, 1};
backend_ctx->enqueue_ndrange_kernel(kernel, 3, global_work_size, local_work_size, dst);
} else if (kernel == backend_ctx->kernel_mul_mat_f16_f32_l4_dr_ls) {
size_t global_work_size[] = {(size_t)CEIL_DIV(ne01, 2)*nth0, (size_t)nth1, (size_t)ne02*ne03};
size_t local_work_size[] = {(size_t)nth0, (size_t)nth1, 1};
backend_ctx->enqueue_ndrange_kernel(kernel, 3, global_work_size, local_work_size, dst);
backend_ctx->enqueue_ndrange_kernel(kernel, 3, global_work_size, local_work_size, dst);
} else if (kernel == backend_ctx->kernel_mul_mat_f16_f32_l4_dr_lq) {
size_t global_work_size[] = {(size_t)CEIL_DIV(ne01, 4)*nth0, (size_t)nth1, (size_t)ne02*ne03};
size_t local_work_size[] = {(size_t)nth0, (size_t)nth1, 1};
backend_ctx->enqueue_ndrange_kernel(kernel, 3, global_work_size, local_work_size, dst);
} else {
int64_t ny = (ne11 + nrows - 1)/nrows;
size_t global_work_size[] = {(size_t)ne01*nth0, (size_t)ny*nth1, (size_t)ne12*ne13};
size_t local_work_size[] = {(size_t)nth0, (size_t)nth1, 1};
backend_ctx->enqueue_ndrange_kernel(kernel, 3, global_work_size, local_work_size, dst);
}
}
}
@@ -82,3 +82,299 @@ kernel void kernel_mul_mat_f16_f32_l4(
}
}
}
// Each subgroup produces DR_NDST outputs, assumes ne11 == 1
#define MUL_MAT_F16_F32_L4_DR_NDST 4
#ifdef ADRENO_GPU
REQD_SUBGROUP_SIZE_64
#endif
kernel void kernel_mul_mat_f16_f32_l4_dr(
global char * src0,
ulong offset0,
global char * src1,
ulong offset1,
global float * dst,
ulong offsetd,
int ne00,
int ne01,
int ne02,
ulong nb00,
ulong nb01,
ulong nb02,
ulong nb03,
int ne10,
int ne11,
int ne12,
ulong nb10,
ulong nb11,
ulong nb12,
ulong nb13,
int ne0,
int ne1,
int r2,
int r3
) {
src0 = (global char*)((global char*)src0 + offset0);
src1 = (global char*)((global char*)src1 + offset1);
dst = (global float*)((global char*)dst + offsetd);
const int r0_base = get_group_id(0) * MUL_MAT_F16_F32_L4_DR_NDST;
const int im = get_group_id(2);
const int i12 = im % ne12;
const int i13 = im / ne12;
// assume ne11 == 1
const ulong offset_src1 = i12*nb12 + i13*nb13;
global float4 * y4 = (global float4 *)(src1 + offset_src1);
global half4 * x4[MUL_MAT_F16_F32_L4_DR_NDST];
float sumf[MUL_MAT_F16_F32_L4_DR_NDST];
const ulong k_head_off = (i12/r2)*nb02 + (i13/r3)*nb03;
#pragma unroll
for (int n = 0; n < MUL_MAT_F16_F32_L4_DR_NDST; ++n) {
int r0 = r0_base + n;
int r0c = r0 < ne01 ? r0 : 0;
ulong off = (ulong)r0c*nb01 + k_head_off;
x4[n] = (global half4 *)(src0 + off);
sumf[n] = 0.0f;
}
const int n_chunks = ne00 / 4;
const int sg_size = get_max_sub_group_size();
const int lid = get_sub_group_local_id();
for (int i = lid; i < n_chunks; i += sg_size) {
float4 q = y4[i];
#pragma unroll
for (int n = 0; n < MUL_MAT_F16_F32_L4_DR_NDST; ++n) {
float4 k = convert_float4(x4[n][i]);
sumf[n] = mad(k.s0, q.s0, sumf[n]);
sumf[n] = mad(k.s1, q.s1, sumf[n]);
sumf[n] = mad(k.s2, q.s2, sumf[n]);
sumf[n] = mad(k.s3, q.s3, sumf[n]);
}
}
#pragma unroll
for (int n = 0; n < MUL_MAT_F16_F32_L4_DR_NDST; ++n) {
float reduced = sub_group_reduce_add(sumf[n]);
int r0 = r0_base + n;
if (lid == 0 && r0 < ne01) {
dst[im*ne1*ne0 + r0] = reduced;
}
}
}
// Kernels for decoding, Adreno only for now
#define MUL_MAT_F16_F32_L4_DR_LS_R2_MAX 8
#ifdef ADRENO_GPU
#pragma OPENCL EXTENSION cl_qcom_subgroup_shuffle : enable
#define sub_group_shuffle_xor(val, mask) qcom_sub_group_shuffle_xor((val), (mask), CLK_SUB_GROUP_SHUFFLE_WIDTH_WAVE_SIZE_QCOM, 0.0f)
REQD_SUBGROUP_SIZE_64
kernel void kernel_mul_mat_f16_f32_l4_dr_ls(
global char * src0,
ulong offset0,
global char * src1,
ulong offset1,
global float * dst,
ulong offsetd,
int ne00,
int ne01,
int ne02,
ulong nb00,
ulong nb01,
ulong nb02,
ulong nb03,
int ne10,
int ne11,
int ne12,
ulong nb10,
ulong nb11,
ulong nb12,
ulong nb13,
int ne0,
int ne1,
int r2,
int r3
) {
src0 = (global char*)((global char*)src0 + offset0);
src1 = (global char*)((global char*)src1 + offset1);
dst = (global float*)((global char*)dst + offsetd);
const int r0_base = get_group_id(0) * 2;
const int kv_grp = get_group_id(2); // KV head group; im = kv_grp*r2 + q
const int i12_kv = kv_grp % ne02;
const int i13_kv = kv_grp / ne02;
const int lid = get_sub_group_local_id();
const int subhalf = lid >> 5; // 0 or 1 (which K row in the WG)
const int intra = lid & 31; // 0..31 (lane within the half)
const int r0 = r0_base + subhalf;
const int r0c = r0 < ne01 ? r0 : 0; // clamp OOB to row 0; skip write below
// K row pointer for this lane (one K row per half-wave).
const ulong k_off = (ulong)r0c*nb01 + (ulong)i12_kv*nb02 + (ulong)i13_kv*nb03;
global half4 * x4 = (global half4 *)(src0 + k_off);
global float4 * y4[MUL_MAT_F16_F32_L4_DR_LS_R2_MAX];
#pragma unroll
for (int q = 0; q < MUL_MAT_F16_F32_L4_DR_LS_R2_MAX; ++q) {
const int i12_q = i12_kv*r2 + q;
const ulong q_off = (ulong)i12_q*nb12 + (ulong)i13_kv*nb13;
y4[q] = (global float4 *)(src1 + q_off);
}
float partial[MUL_MAT_F16_F32_L4_DR_LS_R2_MAX];
#pragma unroll
for (int q = 0; q < MUL_MAT_F16_F32_L4_DR_LS_R2_MAX; ++q) {
partial[q] = 0.0f;
}
const int n_chunks = ne00 / 4;
for (int i = intra; i < n_chunks; i += 32) {
float4 k = convert_float4(x4[i]);
#pragma unroll
for (int q = 0; q < MUL_MAT_F16_F32_L4_DR_LS_R2_MAX; ++q) {
if (q < r2) {
float4 v = y4[q][i];
partial[q] = mad(k.s0, v.s0, partial[q]);
partial[q] = mad(k.s1, v.s1, partial[q]);
partial[q] = mad(k.s2, v.s2, partial[q]);
partial[q] = mad(k.s3, v.s3, partial[q]);
}
}
}
// half-wave reduction
#pragma unroll
for (int q = 0; q < MUL_MAT_F16_F32_L4_DR_LS_R2_MAX; ++q) {
if (q < r2) {
partial[q] += sub_group_shuffle_xor(partial[q], 1u);
partial[q] += sub_group_shuffle_xor(partial[q], 2u);
partial[q] += sub_group_shuffle_xor(partial[q], 4u);
partial[q] += sub_group_shuffle_xor(partial[q], 8u);
partial[q] += sub_group_shuffle_xor(partial[q], 16u);
}
}
if (intra == 0 && r0 < ne01) {
#pragma unroll
for (int q = 0; q < MUL_MAT_F16_F32_L4_DR_LS_R2_MAX; ++q) {
if (q < r2) {
const int im = i12_kv*r2 + q + i13_kv*ne12;
dst[im*ne1*ne0 + r0] = partial[q];
}
}
}
}
REQD_SUBGROUP_SIZE_64
kernel void kernel_mul_mat_f16_f32_l4_dr_lq(
global char * src0,
ulong offset0,
global char * src1,
ulong offset1,
global float * dst,
ulong offsetd,
int ne00,
int ne01,
int ne02,
ulong nb00,
ulong nb01,
ulong nb02,
ulong nb03,
int ne10,
int ne11,
int ne12,
ulong nb10,
ulong nb11,
ulong nb12,
ulong nb13,
int ne0,
int ne1,
int r2,
int r3
) {
src0 = (global char*)((global char*)src0 + offset0);
src1 = (global char*)((global char*)src1 + offset1);
dst = (global float*)((global char*)dst + offsetd);
const int r0_base = get_group_id(0) * 4;
const int kv_grp = get_group_id(2);
const int i12_kv = kv_grp % ne02;
const int i13_kv = kv_grp / ne02;
const int lid = get_sub_group_local_id();
const int subq = lid >> 4; // 0..3 (which K row)
const int intra = lid & 15; // 0..15 (lane within quarter)
const int r0 = r0_base + subq;
const int r0c = r0 < ne01 ? r0 : 0;
const ulong k_off = (ulong)r0c*nb01 + (ulong)i12_kv*nb02 + (ulong)i13_kv*nb03;
global half4 * x4 = (global half4 *)(src0 + k_off);
global float4 * y4[MUL_MAT_F16_F32_L4_DR_LS_R2_MAX];
#pragma unroll
for (int q = 0; q < MUL_MAT_F16_F32_L4_DR_LS_R2_MAX; ++q) {
const int i12_q = i12_kv*r2 + q;
const ulong q_off = (ulong)i12_q*nb12 + (ulong)i13_kv*nb13;
y4[q] = (global float4 *)(src1 + q_off);
}
float partial[MUL_MAT_F16_F32_L4_DR_LS_R2_MAX];
#pragma unroll
for (int q = 0; q < MUL_MAT_F16_F32_L4_DR_LS_R2_MAX; ++q) {
partial[q] = 0.0f;
}
const int n_chunks = ne00 / 4;
for (int i = intra; i < n_chunks; i += 16) {
float4 k = convert_float4(x4[i]);
#pragma unroll
for (int q = 0; q < MUL_MAT_F16_F32_L4_DR_LS_R2_MAX; ++q) {
if (q < r2) {
float4 v = y4[q][i];
partial[q] = mad(k.s0, v.s0, partial[q]);
partial[q] = mad(k.s1, v.s1, partial[q]);
partial[q] = mad(k.s2, v.s2, partial[q]);
partial[q] = mad(k.s3, v.s3, partial[q]);
}
}
}
// quarter-wave reduction
#pragma unroll
for (int q = 0; q < MUL_MAT_F16_F32_L4_DR_LS_R2_MAX; ++q) {
if (q < r2) {
partial[q] += sub_group_shuffle_xor(partial[q], 1u);
partial[q] += sub_group_shuffle_xor(partial[q], 2u);
partial[q] += sub_group_shuffle_xor(partial[q], 4u);
partial[q] += sub_group_shuffle_xor(partial[q], 8u);
}
}
if (intra == 0 && r0 < ne01) {
#pragma unroll
for (int q = 0; q < MUL_MAT_F16_F32_L4_DR_LS_R2_MAX; ++q) {
if (q < r2) {
const int im = i12_kv*r2 + q + i13_kv*ne12;
dst[im*ne1*ne0 + r0] = partial[q];
}
}
}
}
#endif // ADRENO_GPU
+5 -5
View File
@@ -39,8 +39,8 @@ if (WIN32)
set(CMAKE_CXX_COMPILER "icx")
set(CMAKE_CXX_COMPILER_ID "IntelLLVM")
endif()
# Level Zero SDK path for Windows (only when GGML_SYCL_SUPPORT_LEVEL_ZERO is enabled)
if(GGML_SYCL_SUPPORT_LEVEL_ZERO)
# Level Zero SDK path for Windows (only when GGML_SYCL_SUPPORT_LEVEL_ZERO_API is enabled)
if(GGML_SYCL_SUPPORT_LEVEL_ZERO_API)
if(DEFINED ENV{LEVEL_ZERO_V1_SDK_PATH})
set(LEVEL_ZERO_V1_SDK_PATH $ENV{LEVEL_ZERO_V1_SDK_PATH})
if(EXISTS "${LEVEL_ZERO_V1_SDK_PATH}")
@@ -105,8 +105,8 @@ endif()
target_compile_options(ggml-sycl PRIVATE "-Wno-narrowing")
message(STATUS "GGML_SYCL_SUPPORT_LEVEL_ZERO ${GGML_SYCL_SUPPORT_LEVEL_ZERO}")
if (GGML_SYCL_SUPPORT_LEVEL_ZERO)
message(STATUS "GGML_SYCL_SUPPORT_LEVEL_ZERO_API ${GGML_SYCL_SUPPORT_LEVEL_ZERO_API}")
if (GGML_SYCL_SUPPORT_LEVEL_ZERO_API)
# Link against Level Zero loader for direct device memory allocation.
# Avoids sycl::malloc_device triggering DMA-buf/TTM system RAM staging
# in the xe kernel driver during multi-GPU inference.
@@ -114,7 +114,7 @@ if (GGML_SYCL_SUPPORT_LEVEL_ZERO)
find_library(ZE_LOADER_LIB ze_loader HINTS ${ONEAPI_ROOT}/lib ${LEVEL_ZERO_V1_SDK_LIB_PATH} ENV LD_LIBRARY_PATH)
if(ZE_LOADER_LIB AND LEVEL_ZERO_INCLUDE_DIR)
target_link_libraries(ggml-sycl PRIVATE ${ZE_LOADER_LIB})
target_compile_definitions(ggml-sycl PRIVATE GGML_SYCL_SUPPORT_LEVEL_ZERO)
target_compile_definitions(ggml-sycl PRIVATE GGML_SYCL_SUPPORT_LEVEL_ZERO_API)
message(STATUS "Level Zero loader found: ${ZE_LOADER_LIB}")
message(STATUS "Level Zero headers found: ${LEVEL_ZERO_INCLUDE_DIR}")
else()
+1
View File
@@ -17,6 +17,7 @@
#include "common.hpp"
#include "concat.hpp"
#include "conv.hpp"
#include "conv3d.hpp"
#include "convert.hpp"
#include "count-equal.hpp"
#include "cpy.hpp"
+5 -5
View File
@@ -12,7 +12,7 @@
#include "common.hpp"
#include <sycl/backend.hpp>
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO_API
#include <level_zero/ze_api.h>
#endif
@@ -84,9 +84,9 @@ int64_t downsample_sycl_global_range(int64_t accumulate_block_num, int64_t block
return sycl_down_blk_size;
}
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO_API
static bool ggml_sycl_use_level_zero_device_alloc(sycl::queue &q) {
return g_ggml_sycl_enable_level_zero &&
return g_ggml_sycl_use_level_zero_api &&
q.get_device().is_gpu() &&
q.get_backend() == sycl::backend::ext_oneapi_level_zero;
}
@@ -95,7 +95,7 @@ static bool ggml_sycl_use_level_zero_device_alloc(sycl::queue &q) {
// Use Level Zero zeMemAllocDevice to avoid sycl::malloc_device triggering
// DMA-buf/TTM system RAM staging in the xe kernel driver during multi-GPU inference.
void * ggml_sycl_malloc_device(size_t size, sycl::queue &q) {
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO_API
if (ggml_sycl_use_level_zero_device_alloc(q)) {
void *ptr = nullptr;
auto ze_ctx = sycl::get_native<sycl::backend::ext_oneapi_level_zero>(q.get_context());
@@ -127,7 +127,7 @@ void * ggml_sycl_malloc_device(size_t size, sycl::queue &q) {
void ggml_sycl_free_device(void *ptr, sycl::queue &q) {
if (!ptr) return;
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO_API
if (ggml_sycl_use_level_zero_device_alloc(q)) {
auto ze_ctx = sycl::get_native<sycl::backend::ext_oneapi_level_zero>(q.get_context());
zeMemFree(ze_ctx, ptr);
+7 -1
View File
@@ -62,6 +62,7 @@ extern int g_ggml_sycl_debug;
extern int g_ggml_sycl_disable_optimize;
extern int g_ggml_sycl_prioritize_dmmv;
extern int g_ggml_sycl_enable_flash_attention;
extern int g_ggml_sycl_dev2dev_memcpy;
#if defined(__clang__) && __has_builtin(__builtin_expect)
@@ -126,6 +127,11 @@ enum ggml_sycl_backend_gpu_mode {
SYCL_MUL_GPU_MODE
};
enum ggml_sycl_dev2dev_memcpy_mode {
DEV2DEV_MEMCPY_SYCL = 0,
DEV2DEV_MEMCPY_L0 = 1,
};
static_assert(sizeof(sycl::half) == sizeof(ggml_fp16_t), "wrong fp16 size");
static void crash() {
@@ -318,7 +324,7 @@ struct ggml_tensor_extra_gpu {
optimize_feature optimized_feature;
};
extern int g_ggml_sycl_enable_level_zero;
extern int g_ggml_sycl_use_level_zero_api;
void * ggml_sycl_malloc_device(size_t size, sycl::queue &q);
void ggml_sycl_free_device(void *ptr, sycl::queue &q);
+158
View File
@@ -0,0 +1,158 @@
#include "conv2d-dw.hpp"
struct conv2d_dw_params {
int in_w, in_h;
int out_w, out_h;
int kernel_w, kernel_h;
int stride_x, stride_y;
int padding_x, padding_y;
int dilation_x, dilation_y;
int channels, batches;
};
struct conv2d_dw_kernel_bounds {
int y_min, y_max;
int x_min, x_max;
};
static inline conv2d_dw_kernel_bounds dw_calculate_kernel_bounds(int out_x, int out_y,
const conv2d_dw_params & p) {
conv2d_dw_kernel_bounds bounds;
bounds.y_min = sycl::max(0, (p.padding_y - out_y * p.stride_y + p.dilation_y - 1) / p.dilation_y);
bounds.y_max = sycl::min(p.kernel_h,
(p.in_h + p.padding_y - out_y * p.stride_y + p.dilation_y - 1) / p.dilation_y);
bounds.x_min = sycl::max(0, (p.padding_x - out_x * p.stride_x + p.dilation_x - 1) / p.dilation_x);
bounds.x_max = sycl::min(p.kernel_w,
(p.in_w + p.padding_x - out_x * p.stride_x + p.dilation_x - 1) / p.dilation_x);
return bounds;
}
static inline int dw_calculate_input_coord(int out_coord, int kern_coord, int stride, int dilation, int padding) {
return out_coord * stride + kern_coord * dilation - padding;
}
// whcn layout: input/output stored as [N, C, H, W]
struct dw_whcn_layout {
static int input_index(int n, int c, int y, int x, const conv2d_dw_params & p) {
return n * (p.channels * p.in_w * p.in_h) + c * p.in_w * p.in_h + y * p.in_w + x;
}
static int kernel_index(int c, int ky, int kx, const conv2d_dw_params & p) {
return c * p.kernel_h * p.kernel_w + ky * p.kernel_w + kx;
}
static int output_index(int n, int c, int y, int x, const conv2d_dw_params & p) {
return n * (p.channels * p.out_w * p.out_h) + c * p.out_w * p.out_h + y * p.out_w + x;
}
static void unpack_indices(int global_idx, const conv2d_dw_params & p,
int & n, int & c, int & out_y, int & out_x) {
out_x = global_idx % p.out_w;
out_y = (global_idx / p.out_w) % p.out_h;
c = (global_idx / (p.out_w * p.out_h)) % p.channels;
n = global_idx / (p.out_w * p.out_h * p.channels);
}
};
// cwhn layout: input/output stored as [N, H, W, C]
struct dw_cwhn_layout {
static int input_index(int n, int c, int y, int x, const conv2d_dw_params & p) {
return n * (p.channels * p.in_w * p.in_h) + (y * p.in_w + x) * p.channels + c;
}
static int kernel_index(int c, int ky, int kx, const conv2d_dw_params & p) {
return (ky * p.kernel_w + kx) * p.channels + c;
}
static int output_index(int n, int c, int y, int x, const conv2d_dw_params & p) {
return n * (p.channels * p.out_w * p.out_h) + y * (p.out_w * p.channels) + x * p.channels + c;
}
static void unpack_indices(int global_idx, const conv2d_dw_params & p,
int & n, int & c, int & out_y, int & out_x) {
c = global_idx % p.channels;
out_x = (global_idx / p.channels) % p.out_w;
out_y = (global_idx / (p.channels * p.out_w)) % p.out_h;
n = global_idx / (p.channels * p.out_w * p.out_h);
}
};
template <typename Layout>
static void conv2d_dw_kernel(const float * input, const float * kernel, float * output,
const conv2d_dw_params p, const sycl::nd_item<3> & item_ct1) {
const int global_idx = item_ct1.get_local_id(2) +
item_ct1.get_group(2) * item_ct1.get_local_range(2);
const int total_elements = p.batches * p.channels * p.out_h * p.out_w;
if (global_idx >= total_elements) {
return;
}
int n, c, out_y, out_x;
Layout::unpack_indices(global_idx, p, n, c, out_y, out_x);
float acc = 0.0f;
const conv2d_dw_kernel_bounds bounds = dw_calculate_kernel_bounds(out_x, out_y, p);
for (int ky = bounds.y_min; ky < bounds.y_max; ++ky) {
const int in_y = dw_calculate_input_coord(out_y, ky, p.stride_y, p.dilation_y, p.padding_y);
for (int kx = bounds.x_min; kx < bounds.x_max; ++kx) {
const int in_x = dw_calculate_input_coord(out_x, kx, p.stride_x, p.dilation_x, p.padding_x);
acc += input[Layout::input_index(n, c, in_y, in_x, p)] *
kernel[Layout::kernel_index(c, ky, kx, p)];
}
}
output[Layout::output_index(n, c, out_y, out_x, p)] = acc;
}
template <typename Layout>
static void conv2d_dw_sycl(const float * x_d, const float * w_d, float * y_d,
const conv2d_dw_params p, const queue_ptr & stream) {
const int total = p.batches * p.channels * p.out_h * p.out_w;
const int num_blocks = (total + SYCL_CONV2D_DW_BLOCK_SIZE - 1) / SYCL_CONV2D_DW_BLOCK_SIZE;
const sycl::range<3> block_dims(1, 1, SYCL_CONV2D_DW_BLOCK_SIZE);
const sycl::range<3> block_nums(1, 1, num_blocks);
stream->parallel_for(sycl::nd_range<3>(block_nums * block_dims, block_dims),
[=](sycl::nd_item<3> item_ct1) {
conv2d_dw_kernel<Layout>(x_d, w_d, y_d, p, item_ct1);
});
}
void ggml_sycl_op_conv2d_dw(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
scope_op_debug_print scope_dbg_print(__func__, dst, /*num_src=*/2);
const ggml_tensor * kernel = dst->src[0];
const ggml_tensor * input = dst->src[1];
GGML_ASSERT(kernel->type == GGML_TYPE_F32 && input->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32);
const float * w_d = (const float *) kernel->data;
const float * x_d = (const float *) input->data;
float * y_d = (float *) dst->data;
const int32_t * p = (const int32_t *) dst->op_params;
const int stride_x = p[0];
const int stride_y = p[1];
const int padding_x = p[2];
const int padding_y = p[3];
const int dilation_x = p[4];
const int dilation_y = p[5];
const int in_w = input->ne[0];
const int in_h = input->ne[1];
const int kernel_w = kernel->ne[0];
const int kernel_h = kernel->ne[1];
const int out_w = dst->ne[0];
const int out_h = dst->ne[1];
const int channels = dst->ne[2];
const int batches = dst->ne[3];
const conv2d_dw_params params = { in_w, in_h, out_w, out_h, kernel_w, kernel_h,
stride_x, stride_y, padding_x, padding_y,
dilation_x, dilation_y, channels, batches };
const queue_ptr stream = ctx.stream();
if (ggml_is_contiguous(input)) {
conv2d_dw_sycl<dw_whcn_layout>(x_d, w_d, y_d, params, stream);
} else if (ggml_is_contiguous_channels(input)) {
conv2d_dw_sycl<dw_cwhn_layout>(x_d, w_d, y_d, params, stream);
} else {
GGML_ABORT("Unsupported memory layout for conv2d_dw");
}
}
+10
View File
@@ -0,0 +1,10 @@
#ifndef GGML_SYCL_CONV2D_DW_HPP
#define GGML_SYCL_CONV2D_DW_HPP
#include "common.hpp"
#define SYCL_CONV2D_DW_BLOCK_SIZE 256
void ggml_sycl_op_conv2d_dw(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
#endif // GGML_SYCL_CONV2D_DW_HPP
+125
View File
@@ -0,0 +1,125 @@
#include "conv2d-transpose.hpp"
#include "convert.hpp"
template <typename kernel_t>
static void conv2d_transpose_kernel(const float * input, const kernel_t * kernel, float * output,
const int in_w, const int in_h,
const int out_w, const int out_h,
const int kernel_w, const int kernel_h,
const int stride,
const int c_in, const int c_out, const int batches,
const sycl::nd_item<3> & item_ct1) {
const int global_idx = item_ct1.get_local_id(2) +
item_ct1.get_group(2) * item_ct1.get_local_range(2);
const int total_elements = out_w * out_h * c_out * batches;
if (global_idx >= total_elements) {
return;
}
const int out_x = global_idx % out_w;
const int out_y = (global_idx / out_w) % out_h;
const int c_idx = (global_idx / (out_w * out_h)) % c_out;
const int n_idx = global_idx / (out_w * out_h * c_out);
float acc = 0.0f;
for (int c_in_idx = 0; c_in_idx < c_in; ++c_in_idx) {
for (int kh = 0; kh < kernel_h; ++kh) {
int in_y = out_y - kh;
if (in_y < 0 || in_y % stride) {
continue;
}
in_y /= stride;
if (in_y >= in_h) {
continue;
}
for (int kw = 0; kw < kernel_w; ++kw) {
int in_x = out_x - kw;
if (in_x < 0 || in_x % stride) {
continue;
}
in_x /= stride;
if (in_x >= in_w) {
continue;
}
const int input_idx = (in_w * in_h * c_in) * n_idx + (in_w * in_h) * c_in_idx + in_w * in_y + in_x;
const int kernel_idx = (kernel_h * kernel_w * c_out) * c_in_idx + (kernel_h * kernel_w) * c_idx +
kernel_w * kh + kw;
acc += input[input_idx] * ggml_sycl_cast<float>(kernel[kernel_idx]);
}
}
}
output[(out_w * out_h * c_out) * n_idx + (out_w * out_h) * c_idx + out_w * out_y + out_x] = acc;
}
template <typename kernel_t>
static void conv2d_transpose_sycl(const float * input_d, const kernel_t * kernel_d, float * output_d,
const int in_w, const int in_h,
const int out_w, const int out_h,
const int kernel_w, const int kernel_h,
const int stride,
const int c_in, const int c_out, const int batches,
const queue_ptr & stream) {
const int total = out_w * out_h * c_out * batches;
const int num_blocks = (total + SYCL_CONV2D_TRANSPOSE_BLOCK_SIZE - 1) / SYCL_CONV2D_TRANSPOSE_BLOCK_SIZE;
const sycl::range<3> block_dims(1, 1, SYCL_CONV2D_TRANSPOSE_BLOCK_SIZE);
const sycl::range<3> block_nums(1, 1, num_blocks);
stream->parallel_for(sycl::nd_range<3>(block_nums * block_dims, block_dims),
[=](sycl::nd_item<3> item_ct1) {
conv2d_transpose_kernel<kernel_t>(input_d, kernel_d, output_d,
in_w, in_h, out_w, out_h, kernel_w, kernel_h,
stride, c_in, c_out, batches, item_ct1);
});
}
// input: (W, H, C_in, N)
// kernel: (W, H, C_out, C_in)
// output: (W, H, C_out, N)
void ggml_sycl_op_conv2d_transpose(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
scope_op_debug_print scope_dbg_print(__func__, dst, /*num_src=*/2);
const ggml_tensor * kernel = dst->src[0];
const ggml_tensor * input = dst->src[1];
GGML_ASSERT(kernel->type == GGML_TYPE_F16 || kernel->type == GGML_TYPE_F32);
GGML_ASSERT(input->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32);
GGML_ASSERT(ggml_is_contiguous(input));
GGML_ASSERT(ggml_is_contiguous(kernel));
GGML_ASSERT(ggml_is_contiguous(dst));
const float * input_d = (const float *) input->data;
float * output_d = (float *) dst->data;
const void * kernel_d = kernel->data;
const int input_w = input->ne[0];
const int input_h = input->ne[1];
const int channels_in = input->ne[2];
const int batches = input->ne[3];
const int output_w = dst->ne[0];
const int output_h = dst->ne[1];
const int channels_out = kernel->ne[2];
const int kernel_w = kernel->ne[0];
const int kernel_h = kernel->ne[1];
const int stride = dst->op_params[0];
GGML_ASSERT(channels_in == kernel->ne[3]);
GGML_ASSERT(stride > 0);
const queue_ptr stream = ctx.stream();
if (kernel->type == GGML_TYPE_F16) {
conv2d_transpose_sycl<sycl::half>(input_d, (const sycl::half *) kernel_d, output_d,
input_w, input_h, output_w, output_h, kernel_w, kernel_h,
stride, channels_in, channels_out, batches, stream);
} else {
conv2d_transpose_sycl<float>(input_d, (const float *) kernel_d, output_d,
input_w, input_h, output_w, output_h, kernel_w, kernel_h,
stride, channels_in, channels_out, batches, stream);
}
}
+10
View File
@@ -0,0 +1,10 @@
#ifndef GGML_SYCL_CONV2D_TRANSPOSE_HPP
#define GGML_SYCL_CONV2D_TRANSPOSE_HPP
#include "common.hpp"
#define SYCL_CONV2D_TRANSPOSE_BLOCK_SIZE 256
void ggml_sycl_op_conv2d_transpose(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
#endif // GGML_SYCL_CONV2D_TRANSPOSE_HPP
+150
View File
@@ -0,0 +1,150 @@
#include "conv2d.hpp"
#include "convert.hpp"
struct conv2d_params {
const int64_t IW, IH;
const int64_t OW, OH;
const int64_t KW, KH;
const int64_t ST_X, ST_Y;
const int64_t PD_X, PD_Y;
const int64_t DL_X, DL_Y;
const int64_t IC, OC;
const int64_t B;
const int64_t TOTAL;
};
struct conv2d_kernel_bounds {
int64_t y_min, y_max;
int64_t x_min, x_max;
};
static inline int64_t conv2d_max64(int64_t a, int64_t b) {
return (a > b) ? a : b;
}
static inline int64_t conv2d_min64(int64_t a, int64_t b) {
return (a < b) ? a : b;
}
static inline conv2d_kernel_bounds calculate_kernel_bounds(int64_t out_x, int64_t out_y, const conv2d_params & P) {
conv2d_kernel_bounds bounds;
bounds.y_min = conv2d_max64(0, (P.PD_Y - out_y * P.ST_Y + P.DL_Y - 1) / P.DL_Y);
bounds.y_max = conv2d_min64(P.KH, (P.IH + P.PD_Y - out_y * P.ST_Y + P.DL_Y - 1) / P.DL_Y);
bounds.x_min = conv2d_max64(0, (P.PD_X - out_x * P.ST_X + P.DL_X - 1) / P.DL_X);
bounds.x_max = conv2d_min64(P.KW, (P.IW + P.PD_X - out_x * P.ST_X + P.DL_X - 1) / P.DL_X);
return bounds;
}
static inline int calculate_input_coord(int64_t out_coord, int64_t kern_coord, int64_t stride,
int64_t dilation, int64_t padding) {
return out_coord * stride + kern_coord * dilation - padding;
}
// whcn layout helpers (matching ggml tensor memory order)
static inline int64_t whcn_input_index(int64_t n, int64_t c, int64_t y, int64_t x, const conv2d_params & P) {
return n * (P.IC * P.IW * P.IH) + c * P.IW * P.IH + y * P.IW + x;
}
static inline int64_t whcn_kernel_index(int64_t c_out, int64_t c_in, int64_t ky, int64_t kx, const conv2d_params & P) {
return c_out * (P.IC * P.KH * P.KW) + c_in * (P.KH * P.KW) + ky * P.KW + kx;
}
static inline int64_t whcn_output_index(int64_t n, int64_t c, int64_t y, int64_t x, const conv2d_params & P) {
return n * (P.OC * P.OW * P.OH) + c * P.OW * P.OH + y * P.OW + x;
}
template <typename T>
static void conv2d_kernel(const float * input, const T * kernel, float * output,
const conv2d_params P, const sycl::nd_item<3> & item_ct1) {
const int64_t global_idx = item_ct1.get_local_id(2) +
item_ct1.get_group(2) * item_ct1.get_local_range(2);
if (global_idx >= P.TOTAL) {
return;
}
const int64_t out_x = global_idx % P.OW;
const int64_t out_y = (global_idx / P.OW) % P.OH;
const int64_t c_out = (global_idx / (P.OW * P.OH)) % P.OC;
const int64_t n = global_idx / (P.OW * P.OH * P.OC);
float acc = 0.0f;
const conv2d_kernel_bounds bounds = calculate_kernel_bounds(out_x, out_y, P);
for (int64_t c_in = 0; c_in < P.IC; ++c_in) {
for (int64_t ky = bounds.y_min; ky < bounds.y_max; ++ky) {
const int64_t in_y = calculate_input_coord(out_y, ky, P.ST_Y, P.DL_Y, P.PD_Y);
for (int64_t kx = bounds.x_min; kx < bounds.x_max; ++kx) {
const int64_t in_x = calculate_input_coord(out_x, kx, P.ST_X, P.DL_X, P.PD_X);
const float input_val = input[whcn_input_index(n, c_in, in_y, in_x, P)];
const T kernel_val = kernel[whcn_kernel_index(c_out, c_in, ky, kx, P)];
acc += input_val * ggml_sycl_cast<float>(kernel_val);
}
}
}
output[whcn_output_index(n, c_out, out_y, out_x, P)] = acc;
}
template <typename T>
static void conv2d_sycl(const float * X_D, const T * K_D, float * Y_D,
const conv2d_params P, const queue_ptr & stream) {
const int num_blocks = (P.TOTAL + SYCL_CONV2D_BLOCK_SIZE - 1) / SYCL_CONV2D_BLOCK_SIZE;
const sycl::range<3> block_dims(1, 1, SYCL_CONV2D_BLOCK_SIZE);
const sycl::range<3> block_nums(1, 1, num_blocks);
stream->parallel_for(sycl::nd_range<3>(block_nums * block_dims, block_dims),
[=](sycl::nd_item<3> item_ct1) {
conv2d_kernel<T>(X_D, K_D, Y_D, P, item_ct1);
});
}
void ggml_sycl_op_conv2d(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
scope_op_debug_print scope_dbg_print(__func__, dst, /*num_src=*/2);
const ggml_tensor * kernel = dst->src[0];
const ggml_tensor * input = dst->src[1];
const float * K_D = (const float *) kernel->data;
const float * X_D = (const float *) input->data;
float * Y_D = (float *) dst->data;
GGML_ASSERT(ggml_is_contiguous(kernel));
GGML_ASSERT(kernel->type == GGML_TYPE_F16 || kernel->type == GGML_TYPE_F32);
GGML_ASSERT(input->type == GGML_TYPE_F32);
GGML_ASSERT(dst->type == GGML_TYPE_F32);
// same number of input channels
GGML_ASSERT(input->ne[2] == kernel->ne[2]);
const queue_ptr stream = ctx.stream();
const int32_t * p = (const int32_t *) dst->op_params;
const int ST_X = p[0];
const int ST_Y = p[1];
const int PD_X = p[2];
const int PD_Y = p[3];
const int DL_X = p[4];
const int DL_Y = p[5];
// no cwhn layout support
GGML_ASSERT(p[6] == 0);
const int IW = input->ne[0];
const int IH = input->ne[1];
const int OW = dst->ne[0];
const int OH = dst->ne[1];
const int KW = kernel->ne[0];
const int KH = kernel->ne[1];
const int IC = input->ne[2];
const int OC = kernel->ne[3];
const int B = input->ne[3];
const int64_t total = (int64_t) B * OC * OH * OW;
const conv2d_params params = { IW, IH, OW, OH, KW, KH, ST_X, ST_Y, PD_X, PD_Y, DL_X, DL_Y, IC, OC, B, total };
if (kernel->type == GGML_TYPE_F16) {
conv2d_sycl<sycl::half>(X_D, (const sycl::half *) K_D, Y_D, params, stream);
} else {
conv2d_sycl<float>(X_D, K_D, Y_D, params, stream);
}
}
+10
View File
@@ -0,0 +1,10 @@
#ifndef GGML_SYCL_CONV2D_HPP
#define GGML_SYCL_CONV2D_HPP
#include "common.hpp"
#define SYCL_CONV2D_BLOCK_SIZE 256
void ggml_sycl_op_conv2d(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
#endif // GGML_SYCL_CONV2D_HPP
+218
View File
@@ -0,0 +1,218 @@
#include "conv3d.hpp"
static inline int64_t ggml_sycl_conv3d_calc_patch_total(const ggml_tensor * dst, int32_t n) {
return (int64_t) n * dst->ne[0] * dst->ne[1] * dst->ne[2];
}
static inline int64_t ggml_sycl_conv3d_calc_knl_n_total(const ggml_tensor * src0, int32_t c) {
return (int64_t) src0->ne[0] * src0->ne[1] * src0->ne[2] * c;
}
static inline void ggml_sycl_conv3d_write_output(
const ggml_tensor * dst,
const float * src, float * dst_data,
int64_t patch_total, int64_t oc,
int64_t dst_w, int64_t dst_h, int64_t dst_d,
dpct::queue_ptr stream) {
const int64_t dst_nb0 = dst->nb[0];
const int64_t dst_nb1 = dst->nb[1];
const int64_t dst_nb2 = dst->nb[2];
const int64_t dst_nb3 = dst->nb[3];
const int64_t total = patch_total * oc;
const int64_t block_size = 256;
const int64_t num_work_items = ((total + block_size - 1) / block_size) * block_size;
stream->parallel_for(sycl::range<1>(num_work_items), [=](sycl::id<1> id) {
const int64_t i = id[0];
if (i >= total) {
return;
}
const int64_t patch_idx = i / oc;
const int64_t out_ch = i % oc;
const int64_t p_in_batch = patch_idx % (dst_w * dst_h * dst_d);
const int64_t batch_idx = patch_idx / (dst_w * dst_h * dst_d);
const int64_t dst_z = p_in_batch / (dst_w * dst_h);
const int64_t dst_y = (p_in_batch % (dst_w * dst_h)) / dst_w;
const int64_t dst_x = p_in_batch % dst_w;
const int64_t ocn_idx = batch_idx * oc + out_ch;
const int64_t dst_offset = dst_x * dst_nb0 + dst_y * dst_nb1 + dst_z * dst_nb2 + ocn_idx * dst_nb3;
// `src` is a column-major (m x n) GEMM output where m == patch_total, n == oc.
// GEMM stores element (row, col) at index `row + col*m`, so compute index accordingly.
const int64_t src_index = patch_idx + out_ch * patch_total;
const float value = src[src_index];
*(float *)((char *)dst_data + dst_offset) = value;
});
}
void ggml_sycl_op_conv_3d(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
scope_op_debug_print scope_dbg_print(__func__, dst, /*num_src=*/2);
const ggml_tensor * src0 = dst->src[0];
const ggml_tensor * src1 = dst->src[1];
GGML_ASSERT(src0->type == GGML_TYPE_F16 || src0->type == GGML_TYPE_F32);
GGML_ASSERT(src1->type == GGML_TYPE_F32);
GGML_ASSERT(dst->type == GGML_TYPE_F32);
GGML_ASSERT(ggml_is_contiguous(src0));
GGML_ASSERT(ggml_is_contiguous(src1));
const int32_t * opts = (const int32_t *) dst->op_params;
const int32_t s0 = opts[0];
const int32_t s1 = opts[1];
const int32_t s2 = opts[2];
const int32_t p0 = opts[3];
const int32_t p1 = opts[4];
const int32_t p2 = opts[5];
const int32_t d0 = opts[6];
const int32_t d1 = opts[7];
const int32_t d2 = opts[8];
const int32_t c = opts[9];
const int32_t n = opts[10];
const int32_t oc = opts[11];
const int64_t knl_w = src0->ne[0];
const int64_t knl_h = src0->ne[1];
const int64_t knl_d = src0->ne[2];
const int64_t patch_total = ggml_sycl_conv3d_calc_patch_total(dst, n);
const int64_t knl_n_total = ggml_sycl_conv3d_calc_knl_n_total(src0, c);
const size_t kernel_type_size = ggml_element_size(src0);
ggml_sycl_pool_alloc<float> gemm_output(ctx.pool());
gemm_output.alloc((size_t) patch_total * oc);
ggml_tensor dst_mat = {};
dst_mat.type = GGML_TYPE_F32;
dst_mat.ne[0] = patch_total;
dst_mat.ne[1] = oc;
dst_mat.ne[2] = 1;
dst_mat.ne[3] = 1;
dst_mat.nb[0] = sizeof(float);
dst_mat.nb[1] = dst_mat.nb[0] * dst_mat.ne[0];
dst_mat.nb[2] = dst_mat.nb[1];
dst_mat.nb[3] = dst_mat.nb[2];
dst_mat.data = gemm_output.get();
dst_mat.buffer = dst->buffer;
dst_mat.extra = dst->extra;
dpct::queue_ptr stream = ctx.stream();
// allocate packed arrays: A_packed (k x m), B_packed (k x n)
ggml_sycl_pool_alloc<float> A_packed_alloc(ctx.pool());
ggml_sycl_pool_alloc<float> B_packed_alloc(ctx.pool());
A_packed_alloc.alloc((size_t) knl_n_total * patch_total * sizeof(float));
B_packed_alloc.alloc((size_t) knl_n_total * oc * sizeof(float));
float * A_packed = A_packed_alloc.get();
float * B_packed = B_packed_alloc.get();
const int m = (int) patch_total;
const int n_gemm = (int) oc;
const int k = (int) knl_n_total;
// Combined kernel: im2col -> pack A, and pack B simultaneously
const char * src1_base = (const char *) src1->data;
const int64_t src1_nb0 = src1->nb[0];
const int64_t src1_nb1 = src1->nb[1];
const int64_t src1_nb2 = src1->nb[2];
const int64_t src1_nb3 = src1->nb[3];
// Compute correct strides for src0 as (knl_n_total, oc) matrix
const int64_t src0_packed_nb0 = kernel_type_size;
const int64_t src0_packed_nb1 = kernel_type_size * knl_n_total;
const int64_t KW = knl_w;
const int64_t KH = knl_h;
const int64_t KD = knl_d;
const int64_t PW = dst->ne[0];
const int64_t PH = dst->ne[1];
const int64_t PD = dst->ne[2];
// Pack A (with inline im2col): for each (row, col) in k x m matrix
const int64_t A_total = (int64_t)k * m;
const int64_t A_block_size = 256;
const int64_t A_num_work = ((A_total + A_block_size - 1) / A_block_size) * A_block_size;
stream->parallel_for(sycl::range<1>(A_num_work), [=](sycl::id<1> id) {
const int64_t t = id[0];
if (t >= A_total) return;
const int64_t row = t % k;
const int64_t col = t / k;
// Inline im2col for this element
const int64_t k_index = row;
const int64_t patch_idx = col;
const int64_t ic = k_index / (KD * KH * KW);
const int64_t rem = k_index - ic * (KD * KH * KW);
const int64_t kz = rem / (KH * KW);
const int64_t rem2 = rem - kz * (KH * KW);
const int64_t ky = rem2 / KW;
const int64_t kx = rem2 % KW;
const int64_t p_in_batch = patch_idx % (PW * PH * PD);
const int64_t batch_idx = patch_idx / (PW * PH * PD);
const int64_t dst_z = p_in_batch / (PW * PH);
const int64_t dst_y = (p_in_batch % (PW * PH)) / PW;
const int64_t dst_x = p_in_batch % PW;
const int64_t sx = dst_x * s0 + kx * d0 - p0;
const int64_t sy = dst_y * s1 + ky * d1 - p1;
const int64_t sz = dst_z * s2 + kz * d2 - p2;
float val = 0.0f;
if (sx >= 0 && sx < src1->ne[0] && sy >= 0 && sy < src1->ne[1] && sz >= 0 && sz < src1->ne[2]) {
const int64_t channel_idx = batch_idx * c + ic;
const char * ptr = src1_base + sx * src1_nb0 + sy * src1_nb1 + sz * src1_nb2 + channel_idx * src1_nb3;
val = *(const float *) ptr;
}
A_packed[row + col * (int64_t)k] = val;
});
// Pack B: for each (row, col) in k x n_gemm matrix
const int64_t B_total = (int64_t)k * n_gemm;
const int64_t B_block_size = 256;
const int64_t B_num_work = ((B_total + B_block_size - 1) / B_block_size) * B_block_size;
stream->parallel_for(sycl::range<1>(B_num_work), [=](sycl::id<1> id) {
const int64_t t = id[0];
if (t >= B_total) return;
const int64_t row = t % k;
const int64_t col = t / k;
const char * src_ptr = (const char *) src0->data + row * src0_packed_nb0 + col * src0_packed_nb1;
float v;
if (src0->type == GGML_TYPE_F32) {
v = *(const float *) src_ptr;
} else {
v = sycl::vec<sycl::half, 1>(*(const sycl::half *) src_ptr).convert<float, sycl::rounding_mode::automatic>()[0];
}
B_packed[row + col * (int64_t)k] = v;
});
// GEMM: C = A^T * B where A is (k x m), B is (k x n), C is (m x n)
const float alpha = 1.0f;
const float beta = 0.0f;
const int lda = k;
const int ldb = k;
const int ldc = m;
SYCL_CHECK(CHECK_TRY_ERROR(oneapi::mkl::blas::column_major::gemm(
*stream, oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans,
m, n_gemm, k,
dpct::get_value(&alpha, *stream),
(const float *) A_packed, lda,
(const float *) B_packed, ldb,
dpct::get_value(&beta, *stream),
(float *) dst_mat.data, ldc)));
const float * gemm_data = (const float *) dst_mat.data;
float * dst_data = (float *) dst->data;
ggml_sycl_conv3d_write_output(dst, gemm_data, dst_data, patch_total, oc,
dst->ne[0], dst->ne[1], dst->ne[2], stream);
}
+8
View File
@@ -0,0 +1,8 @@
#ifndef GGML_SYCL_CONV3D_HPP
#define GGML_SYCL_CONV3D_HPP
#include "common.hpp"
void ggml_sycl_op_conv_3d(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
#endif // GGML_SYCL_CONV3D_HPP
+6
View File
@@ -642,6 +642,8 @@ static void convert_unary_sycl(const void * vx, dst_t * y, const int64_t k, dpct
to_fp16_sycl_t ggml_get_to_fp16_sycl(ggml_type type, ggml_tensor * dst) {
switch (type) {
case GGML_TYPE_Q1_0:
return dequantize_block_sycl<QK1_0, QR1_0, dequantize_q1_0>;
case GGML_TYPE_Q4_0:
if (dst->src[0]->extra &&
((ggml_tensor_extra_gpu*)dst->src[0]->extra)->optimized_feature.reorder) {
@@ -724,6 +726,8 @@ to_fp16_sycl_t ggml_get_to_fp16_sycl(ggml_type type, ggml_tensor * dst) {
to_fp32_sycl_t ggml_get_to_fp32_sycl(ggml_type type, ggml_tensor *dst) {
switch (type) {
case GGML_TYPE_Q1_0:
return dequantize_block_sycl<QK1_0, QR1_0, dequantize_q1_0>;
case GGML_TYPE_Q4_0:
if (dst->src[0]->extra &&
((ggml_tensor_extra_gpu*)dst->src[0]->extra)->optimized_feature.reorder) {
@@ -830,6 +834,8 @@ to_fp16_nc_sycl_t ggml_get_to_fp16_nc_sycl(ggml_type type) {
case GGML_TYPE_BF16:
return convert_unary_nc_sycl<sycl::ext::oneapi::bfloat16>;
#endif
case GGML_TYPE_Q1_0:
return dequantize_block_nc_sycl<QK1_0, QR1_0, dequantize_q1_0>;
case GGML_TYPE_Q4_0:
return dequantize_block_nc_sycl<QK4_0, QR4_0, dequantize_q4_0>;
case GGML_TYPE_Q4_1:
+15
View File
@@ -70,6 +70,21 @@ static __dpct_inline__ void dequantize_q4_0_reorder(const void *d_ptr, const int
#endif // GGML_SYCL_F16
}
static __dpct_inline__ void dequantize_q1_0_reorder(const void *d_ptr, const int64_t ib, const void *qs,
const int iqs, dfloat2 &v) {
// Q1_0 reorder layout: scale values followed by quantized bits
const dfloat d = (const dfloat)*((const sycl::half*)d_ptr+ib);
const int bit_index_0 = iqs + 0;
const int bit_index_1 = iqs + 1;
const int bit_0 = (*((const uint8_t *)qs + bit_index_0 / 8) >> (bit_index_0 % 8)) & 1;
const int bit_1 = (*((const uint8_t *)qs + bit_index_1 / 8) >> (bit_index_1 % 8)) & 1;
v.x() = (2 * bit_0 - 1) * d;
v.y() = (2 * bit_1 - 1) * d;
}
static __dpct_inline__ void dequantize_q4_1(const void *vx, const int64_t ib,
const int iqs, dfloat2 &v) {
const block_q4_1 * x = (const block_q4_1 *) vx;
+53
View File
@@ -1423,6 +1423,50 @@ static void dequantize_mul_mat_vec_q4_0_sycl(const void *vx, const dfloat *y,
}
}
static void dequantize_mul_mat_vec_q1_0_sycl_reorder(const void *vx, const dfloat *y,
float *dst, const int ncols,
const int nrows,
dpct::queue_ptr stream) {
GGML_ASSERT(ncols % GGML_SYCL_DMMV_X == 0);
const int block_num_y = (nrows + GGML_SYCL_MMV_Y - 1) / GGML_SYCL_MMV_Y;
// the number of rows may exceed maximum grid size in the y or z dimensions, use the x dimension instead
const sycl::range<3> block_nums(1, 1, block_num_y);
const sycl::range<3> block_dims(1, GGML_SYCL_MMV_Y, WARP_SIZE);
{
dpct::has_capability_or_fail(stream->get_device(),
{sycl::aspect::fp16});
stream->parallel_for(
sycl::nd_range<3>(block_nums * block_dims, block_dims),
[=](sycl::nd_item<3> item_ct1) [[sycl::reqd_sub_group_size(WARP_SIZE)]] {
dequantize_mul_mat_vec_reorder<QK1_0, QR1_0, dequantize_q1_0_reorder>(
vx, y, dst, ncols, nrows, item_ct1);
});
}
}
static void dequantize_mul_mat_vec_q1_0_sycl(const void *vx, const dfloat *y,
float *dst, const int ncols,
const int nrows,
dpct::queue_ptr stream) {
GGML_ASSERT(ncols % GGML_SYCL_DMMV_X == 0);
const int block_num_y = (nrows + GGML_SYCL_MMV_Y - 1) / GGML_SYCL_MMV_Y;
// the number of rows may exceed maximum grid size in the y or z dimensions, use the x dimension instead
const sycl::range<3> block_nums(1, 1, block_num_y);
const sycl::range<3> block_dims(1, GGML_SYCL_MMV_Y, WARP_SIZE);
{
dpct::has_capability_or_fail(stream->get_device(),
{sycl::aspect::fp16});
stream->parallel_for(
sycl::nd_range<3>(block_nums * block_dims, block_dims),
[=](sycl::nd_item<3> item_ct1) [[sycl::reqd_sub_group_size(WARP_SIZE)]] {
dequantize_mul_mat_vec<QK1_0, QR1_0, dequantize_q1_0>(
vx, y, dst, ncols, nrows, item_ct1);
});
}
}
static void dequantize_mul_mat_vec_q4_1_sycl(const void *vx, const dfloat *y,
float *dst, const int ncols,
const int nrows,
@@ -1759,6 +1803,7 @@ void ggml_sycl_op_dequantize_mul_mat_vec(
sycl::half *src1_dfloat = nullptr; // dfloat == half
bool src1_convert_f16 =
src0->type == GGML_TYPE_Q1_0 ||
src0->type == GGML_TYPE_Q4_0 || src0->type == GGML_TYPE_Q4_1 ||
src0->type == GGML_TYPE_Q5_0 || src0->type == GGML_TYPE_Q5_1 ||
src0->type == GGML_TYPE_Q8_0 || src0->type == GGML_TYPE_F16 ||
@@ -1777,6 +1822,14 @@ void ggml_sycl_op_dequantize_mul_mat_vec(
#endif // GGML_SYCL_F16
switch (src0->type) {
case GGML_TYPE_Q1_0:
if ((ggml_tensor_extra_gpu*)dst->src[0]->extra &&
((ggml_tensor_extra_gpu*)dst->src[0]->extra)->optimized_feature.reorder) {
dequantize_mul_mat_vec_q1_0_sycl_reorder(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
} else {
dequantize_mul_mat_vec_q1_0_sycl(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
}
break;
case GGML_TYPE_Q4_0:
if ((ggml_tensor_extra_gpu*)dst->src[0]->extra &&
((ggml_tensor_extra_gpu*)dst->src[0]->extra)->optimized_feature.reorder) {
+15 -7
View File
@@ -13,14 +13,14 @@
#ifndef GGML_SYCL_DPCT_HELPER_HPP
#define GGML_SYCL_DPCT_HELPER_HPP
#include <cstdlib>
#include <iostream>
#include <map>
#include <sycl/sycl.hpp>
#include <sycl/half_type.hpp>
#include <oneapi/mkl.hpp>
#include <map>
#include "ggml.h"
#if defined(__linux__)
#include <sys/mman.h>
#elif defined(_WIN64)
@@ -43,6 +43,7 @@
#include <windows.h>
#endif
#define DPCT_COMPATIBILITY_TEMP (900)
#if defined(_MSC_VER)
@@ -59,6 +60,13 @@
#define __dpct_noinline__ __attribute__((noinline))
#endif
#define DPCT_UNUSED(x) (void)(x)
inline void _abort(const char * str) {
std::cerr << str << std::endl;
std::abort();
}
inline std::string get_device_type_name(const sycl::device &Device) {
auto DeviceType = Device.get_info<sycl::info::device::device_type>();
switch (DeviceType) {
@@ -1017,7 +1025,7 @@ namespace dpct
if (backend == "opencl:cpu") return 4;
if (backend == "opencl:acc") return 5;
printf("convert_backend_index: can't handle backend=%s\n", backend.c_str());
GGML_ABORT("fatal error");
_abort("fatal error");
}
static bool compare_backend(std::string &backend1, std::string &backend2) {
return convert_backend_index(backend1) < convert_backend_index(backend2);
@@ -1426,7 +1434,7 @@ namespace dpct
if (!size)
return sycl::event{};
return q.memcpy(to_ptr, from_ptr, size, dep_events);
GGML_UNUSED(direction);
DPCT_UNUSED(direction);
}
// Get actual copy range and make sure it will not exceed range.
@@ -2092,7 +2100,7 @@ namespace dpct
if (!size)
return sycl::event{};
return q.memcpy(to_ptr, from_ptr, size, dep_events);
GGML_UNUSED(direction);
DPCT_UNUSED(direction);
}
// Get actual copy range and make sure it will not exceed range.
+102 -38
View File
@@ -32,7 +32,7 @@
#include <sycl/sycl.hpp>
#include <sycl/backend.hpp>
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO_API
#include <level_zero/ze_api.h>
#endif
#if defined(GGML_SYCL_GRAPH) && SYCL_EXT_ONEAPI_ASYNC_MEMORY_ALLOC
@@ -62,6 +62,9 @@
#include "ggml-sycl/repeat_back.hpp"
#include "ggml-sycl/set_rows.hpp"
#include "ggml-sycl/set.hpp"
#include "ggml-sycl/conv2d.hpp"
#include "ggml-sycl/conv2d-dw.hpp"
#include "ggml-sycl/conv2d-transpose.hpp"
#include "ggml-sycl/ssm_conv.hpp"
#include "ggml-sycl/sycl_hw.hpp"
#include "ggml-sycl/ssm_scan.hpp"
@@ -84,8 +87,9 @@ int g_ggml_sycl_enable_vmm = 1;
int g_ggml_sycl_prioritize_dmmv = 0;
int g_ggml_sycl_use_async_mem_op = 0;
int g_ggml_sycl_use_async_mem_op_requested = 1;
int g_ggml_sycl_enable_level_zero = 0;
int g_ggml_sycl_use_level_zero_api = 0;
int g_ggml_sycl_enable_flash_attention = 1;
int g_ggml_sycl_dev2dev_memcpy = DEV2DEV_MEMCPY_SYCL;
int g_ggml_sycl_usm_system = 0;
static ggml_sycl_device_info ggml_sycl_init() {
@@ -153,7 +157,7 @@ static ggml_sycl_device_info ggml_sycl_init() {
info.ext_oneapi_level_zero = false;
}
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO_API
if (info.ext_oneapi_level_zero && device.is_gpu() && device.default_queue().get_backend() == sycl::backend::ext_oneapi_level_zero) {
ze_device_handle_t ze_dev = sycl::get_native<sycl::backend::ext_oneapi_level_zero>(device.default_queue().get_device());
ze_device_properties_t props = {};
@@ -168,13 +172,13 @@ static ggml_sycl_device_info ggml_sycl_init() {
info.default_tensor_split[id] /= total_vram;
}
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO_API
// Large buffers can be allocated before ggml_check_sycl() initializes other
// g_ggml_sycl_enable_* globals, so initialize this one as early as we can.
g_ggml_sycl_enable_level_zero =
info.ext_oneapi_level_zero && ggml_sycl_get_env("GGML_SYCL_ENABLE_LEVEL_ZERO", 1);
g_ggml_sycl_use_level_zero_api =
info.ext_oneapi_level_zero && ggml_sycl_get_env("GGML_SYCL_USE_LEVEL_ZERO_API", 1);
#else
g_ggml_sycl_enable_level_zero = 0;
g_ggml_sycl_use_level_zero_api = 0;
#endif
return info;
@@ -272,6 +276,11 @@ static void ggml_check_sycl() try {
g_ggml_sycl_enable_vmm = ggml_sycl_get_env("GGML_SYCL_ENABLE_VMM", 1);
g_ggml_sycl_prioritize_dmmv = ggml_sycl_get_env("GGML_SYCL_PRIORITIZE_DMMV", 0);
g_ggml_sycl_dev2dev_memcpy = ggml_sycl_get_env("GGML_SYCL_DEV2DEV_MEMCPY", DEV2DEV_MEMCPY_SYCL);
if (g_ggml_sycl_use_level_zero_api == 0) {
g_ggml_sycl_dev2dev_memcpy = DEV2DEV_MEMCPY_SYCL;
}
#ifdef SYCL_FLASH_ATTN
g_ggml_sycl_enable_flash_attention = ggml_sycl_get_env("GGML_SYCL_ENABLE_FLASH_ATTN", 1);
#else
@@ -303,10 +312,10 @@ static void ggml_check_sycl() try {
#else
GGML_LOG_INFO(" GGML_SYCL_DNNL: no\n");
#endif
#if defined(GGML_SYCL_SUPPORT_LEVEL_ZERO)
GGML_LOG_INFO(" GGML_SYCL_SUPPORT_LEVEL_ZERO: yes\n");
#if defined(GGML_SYCL_SUPPORT_LEVEL_ZERO_API)
GGML_LOG_INFO(" GGML_SYCL_SUPPORT_LEVEL_ZERO_API: yes\n");
#else
GGML_LOG_INFO(" GGML_SYCL_SUPPORT_LEVEL_ZERO: no\n");
GGML_LOG_INFO(" GGML_SYCL_SUPPORT_LEVEL_ZERO_API: no\n");
#endif
#if defined(GGML_SYCL_USE_VMM)
GGML_LOG_INFO(" GGML_SYCL_USE_VMM: yes\n");
@@ -322,10 +331,13 @@ static void ggml_check_sycl() try {
#else
GGML_LOG_INFO(" GGML_SYCL_DISABLE_GRAPH: graph disabled by compile flag\n");
#endif
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
GGML_LOG_INFO(" GGML_SYCL_ENABLE_LEVEL_ZERO: %d\n", g_ggml_sycl_enable_level_zero);
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO_API
GGML_LOG_INFO(" GGML_SYCL_USE_LEVEL_ZERO_API: %d\n", g_ggml_sycl_use_level_zero_api);
GGML_LOG_INFO(" GGML_SYCL_DEV2DEV_MEMCPY: %d\n", g_ggml_sycl_dev2dev_memcpy);
#else
GGML_LOG_INFO(" GGML_SYCL_ENABLE_LEVEL_ZERO: Level Zero disabled by compile flag\n");
GGML_LOG_INFO(" GGML_SYCL_USE_LEVEL_ZERO_API: Disable Level Zero API usage by compile flag\n");
GGML_LOG_INFO(" GGML_SYCL_DEV2DEV_MEMCPY: %d, enable to SYCL API since missing GGML_SYCL_SUPPORT_LEVEL_ZERO_API\n",
g_ggml_sycl_dev2dev_memcpy);
#endif
#if GGML_SYCL_DNNL
GGML_LOG_INFO(" GGML_SYCL_DISABLE_DNN: %d\n", g_ggml_sycl_disable_dnn);
@@ -590,7 +602,7 @@ catch (sycl::exception const &exc) {
std::exit(1);
}
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO_API
static bool ggml_sycl_is_l0_discrete_gpu(int device) {
return ggml_sycl_info().devices[device].l0_discrete_gpu;
}
@@ -598,27 +610,42 @@ static bool ggml_sycl_is_l0_discrete_gpu(int device) {
static void dev2dev_memcpy(int device_dst, sycl::queue &q_dst, int device_src, sycl::queue &q_src, void *ptr_dst,
const void *ptr_src, size_t size) {
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
// Use Level Zero direct copy for dGPU-to-dGPU transfers.
const bool l0_copy_supported = g_ggml_sycl_enable_level_zero &&
ggml_sycl_is_l0_discrete_gpu(device_dst) && ggml_sycl_is_l0_discrete_gpu(device_src);
if (l0_copy_supported) {
auto ze_ctx = sycl::get_native<sycl::backend::ext_oneapi_level_zero>(q_dst.get_context());
auto ze_dev = sycl::get_native<sycl::backend::ext_oneapi_level_zero>(q_dst.get_device());
ze_command_queue_desc_t cq_desc = {ZE_STRUCTURE_TYPE_COMMAND_QUEUE_DESC, nullptr, 0, 0,
0, ZE_COMMAND_QUEUE_MODE_SYNCHRONOUS, ZE_COMMAND_QUEUE_PRIORITY_NORMAL};
ze_command_list_handle_t cl;
ze_result_t r = zeCommandListCreateImmediate(ze_ctx, ze_dev, &cq_desc, &cl);
if (r == ZE_RESULT_SUCCESS) {
r = zeCommandListAppendMemoryCopy(cl, ptr_dst, ptr_src, size, nullptr, 0, nullptr);
zeCommandListDestroy(cl);
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO_API
if (g_ggml_sycl_dev2dev_memcpy == DEV2DEV_MEMCPY_L0) {
// Use Level Zero direct copy for dGPU-to-dGPU transfers.
const bool l0_copy_supported =
ggml_sycl_is_l0_discrete_gpu(device_dst) && ggml_sycl_is_l0_discrete_gpu(device_src);
if (g_ggml_sycl_use_level_zero_api && l0_copy_supported) {
auto ze_ctx = sycl::get_native<sycl::backend::ext_oneapi_level_zero>(q_dst.get_context());
auto ze_dev = sycl::get_native<sycl::backend::ext_oneapi_level_zero>(q_dst.get_device());
ze_command_queue_desc_t cq_desc = {ZE_STRUCTURE_TYPE_COMMAND_QUEUE_DESC, nullptr, 0, 0,
0, ZE_COMMAND_QUEUE_MODE_SYNCHRONOUS, ZE_COMMAND_QUEUE_PRIORITY_NORMAL};
ze_command_list_handle_t cl;
ze_result_t r = zeCommandListCreateImmediate(ze_ctx, ze_dev, &cq_desc, &cl);
if (r == ZE_RESULT_SUCCESS) {
return;
GGML_SYCL_DEBUG("[SYCL] dev2dev memcpy by L0\n");
r = zeCommandListAppendMemoryCopy(cl, ptr_dst, ptr_src, size, nullptr, 0, nullptr);
zeCommandListDestroy(cl);
if (r == ZE_RESULT_SUCCESS) {
return;
}
}
}
}
#endif
if (g_ggml_sycl_dev2dev_memcpy == DEV2DEV_MEMCPY_SYCL) {
if (q_dst.get_device().ext_oneapi_can_access_peer(q_src.get_device(),
sycl::ext::oneapi::peer_access::access_supported)) {
GGML_SYCL_DEBUG("[SYCL] dev2dev memcpy by SYCL\n");
SYCL_CHECK(CHECK_TRY_ERROR(q_dst.memcpy(ptr_dst, ptr_src, size).wait()));
return;
}
}
// Host-staged copy
GGML_SYCL_DEBUG("[SYCL] dev2dev memcpy by host forward\n");
char *host_buf = (char *)malloc(size);
q_src.memcpy(host_buf, (const char *)ptr_src, size).wait();
q_dst.memcpy((char *)ptr_dst, host_buf, size).wait();
@@ -949,6 +976,7 @@ static int64_t get_row_rounding(ggml_type type, const std::array<float, GGML_SYC
}
switch(type) {
case GGML_TYPE_Q1_0:
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1:
return max_compute_capability >= VER_GEN9 ? 128 : 64;
@@ -3480,6 +3508,7 @@ inline bool ggml_sycl_supports_mmq(enum ggml_type type) {
inline bool ggml_sycl_supports_reorder_mul_mat_sycl(enum ggml_type type) {
switch (type) {
case GGML_TYPE_Q1_0:
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q8_0:
return true;
@@ -3495,6 +3524,7 @@ inline bool ggml_sycl_supports_reorder_mul_mat_sycl(enum ggml_type type) {
inline bool ggml_sycl_supports_reorder_dmmv(enum ggml_type type) {
switch (type) {
case GGML_TYPE_Q1_0:
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q8_0:
return true;
@@ -3505,6 +3535,7 @@ inline bool ggml_sycl_supports_reorder_dmmv(enum ggml_type type) {
inline bool ggml_sycl_supports_reorder_mmvq(enum ggml_type type) {
switch (type) {
case GGML_TYPE_Q1_0:
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q8_0:
case GGML_TYPE_Q3_K:
@@ -3519,6 +3550,7 @@ inline bool ggml_sycl_supports_reorder_mmvq(enum ggml_type type) {
static bool ggml_sycl_supports_dmmv(enum ggml_type type) {
switch (type) {
case GGML_TYPE_Q1_0:
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1:
case GGML_TYPE_Q5_0:
@@ -4572,6 +4604,11 @@ static void ggml_sycl_im2col_3d(ggml_backend_sycl_context & ctx, ggml_tensor * d
ggml_sycl_op_im2col_3d(ctx, dst);
}
static void ggml_sycl_conv_3d(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
scope_op_debug_print scope_dbg_print(__func__, dst, /*num_src=*/2);
ggml_sycl_op_conv_3d(ctx, dst);
}
static void ggml_sycl_sum(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
scope_op_debug_print scope_dbg_print(__func__, dst, /*num_src=*/1);
GGML_ASSERT(ggml_is_contiguous(dst->src[0]));
@@ -4635,9 +4672,21 @@ static bool ggml_sycl_compute_forward(ggml_backend_sycl_context & ctx, struct gg
case GGML_OP_ARGMAX:
ggml_sycl_argmax(ctx, dst);
break;
case GGML_OP_CONV_2D:
ggml_sycl_op_conv2d(ctx, dst);
break;
case GGML_OP_CONV_2D_DW:
ggml_sycl_op_conv2d_dw(ctx, dst);
break;
case GGML_OP_CONV_3D:
ggml_sycl_conv_3d(ctx, dst);
break;
case GGML_OP_CONV_TRANSPOSE_1D:
ggml_sycl_op_conv_transpose_1d(ctx, dst);
break;
case GGML_OP_CONV_TRANSPOSE_2D:
ggml_sycl_op_conv2d_transpose(ctx, dst);
break;
case GGML_OP_REPEAT:
ggml_sycl_repeat(ctx, dst);
break;
@@ -5341,7 +5390,7 @@ static ggml_backend_buffer_t ggml_backend_sycl_device_buffer_from_host_ptr(ggml_
return nullptr;
}
static bool ggml_backend_sycl_device_supports_op(ggml_backend_dev_t dev, const ggml_tensor * op) {
static bool do_ggml_backend_sycl_device_supports_op(ggml_backend_dev_t dev, const ggml_tensor * op) {
ggml_backend_sycl_device_context *sycl_ctx =
(ggml_backend_sycl_device_context *)dev->context;
int device = sycl_ctx->device;
@@ -5355,6 +5404,10 @@ static bool ggml_backend_sycl_device_supports_op(ggml_backend_dev_t dev, const g
}
return false;
}
case GGML_OP_CONV_2D:
case GGML_OP_CONV_2D_DW:
case GGML_OP_CONV_TRANSPOSE_2D:
return true;
case GGML_OP_UNARY:
switch (ggml_get_unary_op(op)) {
case GGML_UNARY_OP_SGN:
@@ -5402,19 +5455,12 @@ static bool ggml_backend_sycl_device_supports_op(ggml_backend_dev_t dev, const g
struct ggml_tensor * a = op->src[0];
struct ggml_tensor * b = op->src[1];
// disable Q1_0 until implementation
if (a->type == GGML_TYPE_Q1_0 || b->type == GGML_TYPE_Q1_0) {
return false;
}
if (a->ne[3] != b->ne[3]) {
return false;
}
ggml_type src0_type = op->src[0]->type;
// TODO: The configuration below needs more work to be supported with oneDNN
if (ggml_is_permuted(a) && !ggml_is_contiguous(a) &&
a->ne[2] > 1 && a->ne[3] > 1 && src0_type == GGML_TYPE_F16) {
@@ -5424,12 +5470,17 @@ static bool ggml_backend_sycl_device_supports_op(ggml_backend_dev_t dev, const g
// TODO: This specific configuration can fail with oneDNN and needs more debugging
if (!ggml_is_permuted(a) && ggml_is_permuted(b) && b->ne[2] > 1 && b->ne[3] > 1 &&
a->ne[0] > 128 && a->ne[2] == 1 && src0_type == GGML_TYPE_F16) {
printf("zjy 2\n");
return false;
}
return true;
}
case GGML_OP_OUT_PROD:
return op->type == GGML_TYPE_F32 && op->src[0]->type == GGML_TYPE_F32 && op->src[1]->type == GGML_TYPE_F32 && op->ne[2] == 1 && op->ne[3] == 1;
return op->type == GGML_TYPE_F32 &&
(op->src[0]->type == GGML_TYPE_F32 ||
(op->src[0]->type == GGML_TYPE_Q1_0 && op->src[0]->ne[2] == op->src[1]->ne[2] &&
op->src[0]->ne[3] == op->src[1]->ne[3])) &&
op->src[1]->type == GGML_TYPE_F32;
case GGML_OP_GET_ROWS:
{
switch (op->src[0]->type) {
@@ -5615,6 +5666,12 @@ static bool ggml_backend_sycl_device_supports_op(ggml_backend_dev_t dev, const g
case GGML_OP_IM2COL_3D:
case GGML_OP_UPSCALE:
return true;
case GGML_OP_CONV_3D:
return op->type == GGML_TYPE_F32 &&
(op->src[0]->type == GGML_TYPE_F32 || op->src[0]->type == GGML_TYPE_F16) &&
op->src[1]->type == GGML_TYPE_F32 &&
ggml_is_contiguous(op->src[0]) &&
ggml_is_contiguous(op->src[1]);
case GGML_OP_SUM:
case GGML_OP_SUM_ROWS:
case GGML_OP_MEAN:
@@ -5680,6 +5737,13 @@ static bool ggml_backend_sycl_device_supports_op(ggml_backend_dev_t dev, const g
GGML_UNUSED(dev);
}
static bool ggml_backend_sycl_device_supports_op(ggml_backend_dev_t dev, const ggml_tensor * op) {
bool res = do_ggml_backend_sycl_device_supports_op(dev, op);
GGML_SYCL_DEBUG("[SYCL] call %s op->op=%s op->type=%s -> %s\n", __func__, ggml_op_name(op->op),
ggml_type_name(op->type), res ? "true" : "false");
return res;
}
static bool ggml_backend_sycl_device_supports_buft(ggml_backend_dev_t dev, ggml_backend_buffer_type_t buft) {
if (buft->iface.get_name != ggml_backend_sycl_buffer_type_get_name) {
return false;
+74
View File
@@ -1194,6 +1194,66 @@ static void mul_mat_vec_q8_0_q8_1_sycl_switch_ncols(
}
}
static void mul_mat_vec_q1_0_q8_1_sycl(const void * vx, const void * vy,
float * dst, const int ncols,
const int nrows,
dpct::queue_ptr stream) {
GGML_ASSERT(ncols % QK1_0 == 0);
const int block_num_y = (nrows + GGML_SYCL_MMV_Y - 1) / GGML_SYCL_MMV_Y;
const sycl::range<3> block_nums(1, 1, block_num_y);
const sycl::range<3> block_dims(1, GGML_SYCL_MMV_Y, WARP_SIZE);
stream->submit([&](sycl::handler & cgh) {
cgh.parallel_for(
sycl::nd_range<3>(block_nums * block_dims, block_dims),
[=](sycl::nd_item<3> item_ct1) [[sycl::reqd_sub_group_size(WARP_SIZE)]] {
mul_mat_vec_q<QK1_0, QI1_0, block_q1_0,
VDR_Q1_0_Q8_1_MMVQ, vec_dot_q1_0_q8_1>(
vx, vy, dst, ncols, nrows, item_ct1);
});
});
}
template <int ncols_dst>
static void mul_mat_vec_q1_0_q8_1_sycl_ncols(
const void * vx, const void * vy, float * dst,
const int ncols, const int nrows,
const int stride_col_y, const int stride_col_dst,
dpct::queue_ptr stream) {
GGML_ASSERT(ncols % QK1_0 == 0);
const int block_num_y = (nrows + GGML_SYCL_MMV_Y - 1) / GGML_SYCL_MMV_Y;
const sycl::range<3> block_nums(1, 1, block_num_y);
const sycl::range<3> block_dims(1, GGML_SYCL_MMV_Y, WARP_SIZE);
stream->submit([&](sycl::handler & cgh) {
cgh.parallel_for(
sycl::nd_range<3>(block_nums * block_dims, block_dims),
[=](sycl::nd_item<3> item_ct1) [[sycl::reqd_sub_group_size(WARP_SIZE)]] {
mul_mat_vec_q_ncols<QK1_0, QI1_0, block_q1_0,
VDR_Q1_0_Q8_1_MMVQ, vec_dot_q1_0_q8_1, ncols_dst>(
vx, vy, dst, ncols, nrows, stride_col_y, stride_col_dst, item_ct1);
});
});
}
static void mul_mat_vec_q1_0_q8_1_sycl_switch_ncols(
const void * vx, const void * vy, float * dst,
const int ncols, const int nrows, const int ncols_dst,
const int stride_col_y, const int stride_col_dst,
dpct::queue_ptr stream) {
switch (ncols_dst) {
case 1: mul_mat_vec_q1_0_q8_1_sycl(vx, vy, dst, ncols, nrows, stream); break;
case 2: mul_mat_vec_q1_0_q8_1_sycl_ncols<2>(vx, vy, dst, ncols, nrows, stride_col_y, stride_col_dst, stream); break;
case 3: mul_mat_vec_q1_0_q8_1_sycl_ncols<3>(vx, vy, dst, ncols, nrows, stride_col_y, stride_col_dst, stream); break;
case 4: mul_mat_vec_q1_0_q8_1_sycl_ncols<4>(vx, vy, dst, ncols, nrows, stride_col_y, stride_col_dst, stream); break;
case 5: mul_mat_vec_q1_0_q8_1_sycl_ncols<5>(vx, vy, dst, ncols, nrows, stride_col_y, stride_col_dst, stream); break;
case 6: mul_mat_vec_q1_0_q8_1_sycl_ncols<6>(vx, vy, dst, ncols, nrows, stride_col_y, stride_col_dst, stream); break;
case 7: mul_mat_vec_q1_0_q8_1_sycl_ncols<7>(vx, vy, dst, ncols, nrows, stride_col_y, stride_col_dst, stream); break;
case 8: mul_mat_vec_q1_0_q8_1_sycl_ncols<8>(vx, vy, dst, ncols, nrows, stride_col_y, stride_col_dst, stream); break;
default: GGML_ABORT("unsupported ncols_dst=%d for Q1_0 multi-col MMVQ", ncols_dst);
}
}
static void mul_mat_vec_q2_K_q8_1_sycl(const void *vx, const void *vy,
float *dst, const int ncols,
const int nrows,
@@ -2120,6 +2180,20 @@ void ggml_sycl_op_mul_mat_vec_q(ggml_backend_sycl_context & ctx, const ggml_tens
mul_mat_vec_q8_0_q8_1_sycl(src0_dd_i, src1_ddq_i_bs, dst_dd_i_bs, ne00, row_diff, stream);
}
break;
case GGML_TYPE_Q1_0:
if (i == 0 && src1_ncols > 1 && src1_ncols <= 8) {
const int stride_col_y = src1_padded_col_size / QK8_1;
const int stride_col_dst = dst->ne[0];
GGML_SYCL_DEBUG("Calling mul_mat_vec_q1_0_q8_1_sycl_switch_ncols ncols=%d\n", (int)src1_ncols);
mul_mat_vec_q1_0_q8_1_sycl_switch_ncols(
src0_dd_i, src1_ddq_i, dst_dd_i, ne00, row_diff,
src1_ncols, stride_col_y, stride_col_dst, stream);
return;
} else if (i == 0 || src1_ncols == 1) {
GGML_SYCL_DEBUG("Calling mul_mat_vec_q1_0_q8_1_sycl\n");
mul_mat_vec_q1_0_q8_1_sycl(src0_dd_i, src1_ddq_i_bs, dst_dd_i_bs, ne00, row_diff, stream);
}
break;
case GGML_TYPE_Q2_K:
if (i == 0 && src1_ncols > 1 && src1_ncols <= 8) {
const int stride_col_y = src1_padded_col_size / QK8_1;
+45 -9
View File
@@ -1,11 +1,12 @@
#include "outprod.hpp"
#include "convert.hpp"
void ggml_sycl_op_out_prod(ggml_backend_sycl_context& ctx, ggml_tensor* dst) {
scope_op_debug_print scope_dbg_print(__func__, dst, /*num_src=*/2);
const ggml_tensor *src0 = dst->src[0];
const ggml_tensor *src1 = dst->src[1];
GGML_ASSERT(src0->type == GGML_TYPE_F32);
GGML_ASSERT(src0->type == GGML_TYPE_F32 || src0->type == GGML_TYPE_Q1_0);
GGML_ASSERT(src1->type == GGML_TYPE_F32);
GGML_ASSERT(dst->type == GGML_TYPE_F32);
GGML_ASSERT(ggml_is_contiguous(src0));
@@ -20,11 +21,31 @@ void ggml_sycl_op_out_prod(ggml_backend_sycl_context& ctx, ggml_tensor* dst) {
GGML_ASSERT(ne01 == ne11); // Inner dimensions must match
GGML_ASSERT(ne0 == ne00); // Output rows match src0 rows
GGML_ASSERT(ne1 == ne10); // Output cols match src1 cols
GGML_ASSERT(ne2 == ne12);
GGML_ASSERT(ne3 == ne13);
GGML_ASSERT(ne2 % ne02 == 0);
GGML_ASSERT(ne3 % ne03 == 0);
// Get data pointers
const float* src0_d = (const float*)src0->data;
const float* src1_d = (const float*)src1->data;
float* dst_d = (float*)dst->data;
const float * src0_d = (const float *) src0->data;
const float * src1_d = (const float *) src1->data;
float * dst_d = (float *) dst->data;
ggml_sycl_pool_alloc<float> src0_as_f32(ctx.pool());
int64_t src0_nb02 = nb02;
int64_t src0_nb03 = nb03;
if (src0->type == GGML_TYPE_Q1_0) {
scope_op_debug_print scope_dbg_print(__func__, "/to_fp32_sycl", dst, /*num_src=*/2,
" : converting src0 Q1_0 to fp32");
src0_d = src0_as_f32.alloc(ne00 * ne01 * ne02 * ne03);
const to_fp32_sycl_t to_fp32_sycl = ggml_get_to_fp32_sycl(src0->type, dst);
GGML_ASSERT(to_fp32_sycl != nullptr);
to_fp32_sycl(src0->data, const_cast<float *>(src0_d), ne00 * ne01 * ne02 * ne03, stream);
// Dequantized src0 buffer is contiguous fp32 [ne00, ne01, ne02, ne03].
src0_nb02 = ne00 * ne01 * (int64_t) sizeof(float);
src0_nb03 = ne00 * ne01 * ne02 * (int64_t) sizeof(float);
}
// GEMM parameters
const float alpha = 1.0f;
@@ -35,12 +56,27 @@ void ggml_sycl_op_out_prod(ggml_backend_sycl_context& ctx, ggml_tensor* dst) {
const oneapi::mkl::transpose src1_op = src1_T ? oneapi::mkl::transpose::nontrans : oneapi::mkl::transpose::trans;
const int64_t ldb = (src1_T ? nb10 : nb11) / sizeof(float);
const int64_t r2 = ne2 / ne02;
const int64_t r3 = ne3 / ne03;
try {
// Perform matrix multiplication using oneMKL GEMM
oneapi::mkl::blas::column_major::gemm(*stream, oneapi::mkl::transpose::nontrans, src1_op,
ne0, ne1, ne01, alpha, src0_d, ne00, src1_d, ldb, beta, dst_d, ne0);
}
catch (sycl::exception const& exc) {
// OUT_PROD applies independently to each (i2, i3) destination plane.
for (int64_t i3 = 0; i3 < ne3; ++i3) {
for (int64_t i2 = 0; i2 < ne2; ++i2) {
const int64_t i03 = i3 / r3;
const int64_t i02 = i2 / r2;
const float * src0_plane = (const float *) ((const char *) src0_d + i02 * src0_nb02 + i03 * src0_nb03);
const float * src1_plane = (const float *) ((const char *) src1_d + i2 * nb12 + i3 * nb13);
float * dst_plane = (float *) ((char *) dst_d + i2 * nb2 + i3 * nb3);
// Perform matrix multiplication using oneMKL GEMM
oneapi::mkl::blas::column_major::gemm(*stream, oneapi::mkl::transpose::nontrans, src1_op,
ne0, ne1, ne01, alpha, src0_plane, ne00,
src1_plane, ldb, beta, dst_plane, ne0);
}
}
} catch (sycl::exception const& exc) {
std::cerr << exc.what() << std::endl;
GGML_ASSERT(false);
}
+35
View File
@@ -309,6 +309,41 @@ vec_dot_q6_K_q8_1_impl_mmvq(const int &vl, const int &vh,
vl, vh, u[0], u[1], scales[0], scales[4], d, d8[0], d8[1]);
}
#define VDR_Q1_0_Q8_1_MMVQ 1
#define VDR_Q1_0_Q8_1_MMQ 4
static __dpct_inline__ float
vec_dot_q1_0_q8_1(const void *__restrict__ vbq,
const block_q8_1 *__restrict__ bq8_1, const int &iqs) {
const block_q1_0 * bq1_0 = (const block_q1_0 *) vbq;
const block_q8_1 * bq8_1_chunk = bq8_1 + iqs;
const float d1 = bq1_0->d;
const int v = get_int_from_uint8_aligned(bq1_0->qs, iqs);
int vi_bytes[8];
#pragma unroll
for (int j = 0; j < 8; ++j) {
const int shift = j * 4;
const int bits4 = (v >> shift) & 0x0F;
const int b0 = (bits4 & 0x01) ? 1 : -1;
const int b1 = (bits4 & 0x02) ? 1 : -1;
const int b2 = (bits4 & 0x04) ? 1 : -1;
const int b3 = (bits4 & 0x08) ? 1 : -1;
vi_bytes[j] = (b0 & 0xFF) | ((b1 & 0xFF) << 8) | ((b2 & 0xFF) << 16) | ((b3 & 0xFF) << 24);
}
int sumi = 0;
#pragma unroll
for (int j = 0; j < 8; ++j) {
const int u = get_int_from_int8_aligned(bq8_1_chunk->qs, j);
sumi = ggml_sycl_dp4a(vi_bytes[j], u, sumi);
}
return d1 * bq8_1_chunk->ds[0] * sumi;
}
// VDR = vec dot ratio, how many contiguous integers each thread processes when the vec dot kernel is called
// MMVQ = mul_mat_vec_q, MMQ = mul_mat_q
+1 -1
View File
@@ -3008,13 +3008,13 @@ static vk_buffer ggml_vk_create_buffer(vk_device& device, size_t size, const std
if (memory_type_indices.empty()) {
continue;
}
buf->memory_property_flags = req_flags;
bool done = false;
for (auto mtype_it = memory_type_indices.begin(); mtype_it != memory_type_indices.end(); mtype_it++) {
try {
buf->device_memory = device->device.allocateMemory({ mem_req.size, *mtype_it, &mem_flags_info });
buf->memory_property_flags = mem_props.memoryTypes[*mtype_it].propertyFlags;
done = true;
break;
} catch (const vk::SystemError& e) {
@@ -407,7 +407,7 @@ std::map<std::string, std::string> merge_maps(const std::map<std::string, std::s
return result;
}
static std::vector<std::future<void>> compiles;
static std::deque<std::future<void>> compiles;
void string_to_spv(std::string name, const std::string& source, const std::map<std::string, std::string>& defines, bool fp16 = true, bool coopmat = false, bool coopmat2 = false, bool f16acc = false, const std::string& suffix = "") {
name = name + (f16acc ? "_f16acc" : "") + (coopmat ? "_cm1" : "") + (coopmat2 ? "_cm2" : (fp16 ? "" : "_fp32")) + suffix;
std::string out_path = join_paths(output_dir, name + ".spv");
@@ -426,6 +426,11 @@ void string_to_spv(std::string name, const std::string& source, const std::map<s
string_to_spv_func, name, input_filepath, out_path, defines, coopmat, generate_dep_file, std::move(slot)));
// Don't write the same dep file from multiple processes
generate_dep_file = false;
// Clean up completed futures - don't accumulate virtual memory for completed threads' stacks.
while (!compiles.empty() && compiles.front().wait_for(std::chrono::seconds(0)) == std::future_status::ready) {
compiles.pop_front();
}
}
void matmul_shaders(bool fp16, MatMulIdType matmul_id_type, bool coopmat, bool coopmat2, bool f16acc, bool dot2 = false) {
+270 -27
View File
@@ -6,6 +6,7 @@ import re
import argparse
import statistics
import logging
from typing import Any, Dict, List, Optional
from collections import defaultdict
@@ -25,12 +26,47 @@ COL_MAP = {
}
op_pattern = re.compile(
r"profile-op\s+(?P<op_name>[A-Z_0-9+]+):\s+.*?\s+:\s+(?P<dims>[\d:x\s\->!]+)\s+:\s+(?P<types>[a-z\d_\s\->x]+)\s+:\s+.*?\s+(?:op-)?usec\s+(?P<usec>\d+)\s+(?:op-)?cycles\s+(?P<cycles>\d+)(?:\s+pmu\s+\[(?P<pmu>[\d,\s]+)\])?"
r"profile-op\s+(?P<op_name>[A-Z_0-9+]+):\s+.*?\s+:\s+(?P<dims>[\d:x\s\->!]+)\s+:\s+(?P<types>[a-z\d_\s\->x]+)\s+:\s+.*?\s+(?:op-)?usec\s+(?P<usec>\d+)\s+(?:op-)?cycles\s+(?P<cycles>\d+)(?:\s+start\s+(?P<start>\d+))?(?:\s+mhz\s+(?P<mhz>[\d.]+))?(?:\s+pmu\s+\[(?P<pmu>[\d,\s]+)\])?(?:\s+evt\s+\[(?P<evt>[\d,\s]+)\])?"
)
trace_pattern = re.compile(
r"trace-op\s+(?P<op_name>[A-Z_0-9+]+):\s+thread\s+(?P<thread>\d+)\s+event\s+(?P<event>[A-Z_0-9\-]+)\s+info\s+(?P<info>\d+)\s+(?P<state>start|stop)\s+(?P<cycles>\d+)"
)
logger = logging.getLogger("ggml-hexagon-profile")
def normalize_event_name(evt_type):
if evt_type == "HVX_COMP":
return "V-COMP"
if evt_type == "HMX_COMP":
return "M-COMP"
# Strip HVX_ or HMX_ prefixes
name = evt_type
if name.startswith("HVX_") or name.startswith("HMX_"):
name = name[4:]
return name.replace("_", "-")
class CycleUnwrapper:
def __init__(self):
self.last_raw = None
self.high_part = 0
def unwrap(self, raw):
if self.last_raw is None:
self.last_raw = raw
return raw
diff = raw - self.last_raw
if diff < -0x80000000:
self.high_part += 0x100000000
elif diff > 0x80000000:
self.high_part -= 0x100000000
self.last_raw = raw
return raw + self.high_part
def parse_log(file_path, pmu_index=None):
try:
if file_path != "-":
@@ -41,35 +77,211 @@ def parse_log(file_path, pmu_index=None):
logger.error(f"file '{file_path}' not found.")
sys.exit(1)
all_ops = []
all_ops: List[Dict[str, Any]] = []
current_op: Optional[Dict[str, Any]] = None
timestamp_pattern = re.compile(r"^(?P<min>\d+)\.(?P<sec>\d+)\.(?P<ms>\d+)\.(?P<us>\d+)\s+[A-Z]\s+")
unwrapper = CycleUnwrapper()
for line in f:
match = op_pattern.search(line)
if not match: continue
ts_match = timestamp_pattern.match(line)
abs_usec = 0
if ts_match:
abs_usec = (
(int(ts_match.group('min')) * 60 + int(ts_match.group('sec'))) * 1000000
+ int(ts_match.group('ms')) * 1000
+ int(ts_match.group('us'))
)
pmu_raw = match.group('pmu')
pmu_val = None
if pmu_raw and pmu_index is not None:
try:
pmu_list = [int(x.strip()) for x in pmu_raw.split(',')]
if len(pmu_list) > pmu_index:
pmu_val = pmu_list[pmu_index]
except (ValueError, IndexError):
pmu_val = None
op_match = op_pattern.search(line)
if op_match:
pmu_raw = op_match.group('pmu')
pmu_val = None
if pmu_raw and pmu_index is not None:
try:
pmu_list = [int(x.strip()) for x in pmu_raw.split(',')]
if len(pmu_list) > pmu_index:
pmu_val = pmu_list[pmu_index]
except (ValueError, IndexError):
pmu_val = None
all_ops.append({
'name': match.group('op_name'),
'dims': match.group('dims').strip(),
'types': match.group('types').strip(),
'usec': int(match.group('usec')),
'cycles': int(match.group('cycles')),
'pmu_val': pmu_val
})
evt_raw = op_match.group('evt')
evt_val = None
if evt_raw:
try:
evt_val = [int(x.strip()) for x in evt_raw.split(',')]
except ValueError:
evt_val = None
cycles_start_raw = op_match.group('start')
unwrapped_cycles_start = None
if cycles_start_raw:
unwrapped_cycles_start = unwrapper.unwrap(int(cycles_start_raw))
idx = line.find("profile-op ")
op_text = line[idx + 11:].strip() if idx != -1 else line.strip()
current_op = {
'name': op_match.group('op_name'),
'dims': op_match.group('dims').strip(),
'types': op_match.group('types').strip(),
'op_text': op_text,
'usec': int(op_match.group('usec')),
'cycles': int(op_match.group('cycles')),
'cycles_start': int(cycles_start_raw) if cycles_start_raw else None,
'unwrapped_cycles_start': unwrapped_cycles_start,
'pmu_val': pmu_val,
'evt_val': evt_val,
'abs_usec': abs_usec,
'trace_events': []
}
all_ops.append(current_op)
continue
trace_match = trace_pattern.search(line)
if trace_match and current_op:
if trace_match.group('op_name') == current_op['name']:
raw_cyc = int(trace_match.group('cycles'))
current_op['trace_events'].append({
'thread': int(trace_match.group('thread')),
'event': trace_match.group('event'),
'info': int(trace_match.group('info')),
'cycles': raw_cyc,
'unwrapped_cycles': unwrapper.unwrap(raw_cyc),
'state': trace_match.group('state')
})
f.close()
return all_ops
def print_ascii_timeline(op_name, dims, types, usec, cycles, events, evt_val=None):
evt_str = ""
if evt_val:
evt_str = " - evt [" + ",".join(str(x) for x in evt_val) + "]"
logger.info("=" * 100)
logger.info(f"{op_name} ({dims} : {types}) - {usec} usec {cycles} cycles{evt_str}")
logger.info("=" * 100)
events = sorted(events, key=lambda e: e['cycles'])
if not events:
logger.info(" No trace events recorded.")
return
min_cycles = events[0]['cycles']
logger.info("Cycles %-30s" % "EventDetails" + " ".join(f"T{i:<2}" for i in range(10)) + " HMX")
logger.info("-" * 100)
thread_stacks = [[] for _ in range(11)]
for e in events:
t = e['thread']
if t < 0 or t > 10:
continue
if e['cycles'] >= min_cycles:
rel_cycles = e['cycles'] - min_cycles
else:
rel_cycles = (e['cycles'] + 0x100000000) - min_cycles
state = e['state']
evt_type = e['event']
# Determine char representing the event
norm_evt = normalize_event_name(evt_type)
char = '?'
if norm_evt == 'V-COMP':
char = 'V'
elif norm_evt == 'M-COMP':
char = 'H'
elif norm_evt == 'A-QUANT':
char = 'Q'
elif norm_evt == 'A-PREP':
char = 'A'
elif norm_evt == 'W-DEQUANT':
char = 'D'
elif norm_evt == 'O-PROC':
char = 'O'
elif norm_evt == 'W-PREP':
char = 'P'
elif norm_evt == 'DMA':
char = 'M'
if state == 'start':
thread_stacks[t].append(char)
elif state == 'stop':
if thread_stacks[t]:
if thread_stacks[t][-1] == char:
thread_stacks[t].pop()
elif char in thread_stacks[t]:
thread_stacks[t].remove(char)
else:
thread_stacks[t].pop()
cols = []
for i in range(11):
if thread_stacks[i]:
cols.append(f"[{thread_stacks[i][-1]}]")
else:
cols.append(" | ")
evt_desc = f"T{t}: {evt_type} {state} ({e['info']})"
logger.info(f"{rel_cycles:10d} %-30s" % evt_desc + " ".join(cols[:10]) + " " + cols[10])
logger.info("-" * 100)
def print_ascii_summary(op_name, dims, types, usec, cycles, events, evt_val=None):
evt_str = ""
if evt_val:
evt_str = " - evt [" + ",".join(str(x) for x in evt_val) + "]"
logger.info("=" * 100)
logger.info(f"{op_name} ({dims} : {types}) - {usec} usec {cycles} cycles{evt_str}")
logger.info("=" * 100)
events = sorted(events, key=lambda e: e['cycles'])
if not events:
logger.info(" No trace events recorded.")
return
active_starts = {}
thread_totals = defaultdict(lambda: defaultdict(int))
for e in events:
t = e['thread']
evt = e['event']
info = e['info']
cyc = e['cycles']
state = e['state']
key = (t, evt, info)
if state == 'start':
active_starts[key] = cyc
elif state == 'stop':
if key in active_starts:
start_cyc = active_starts[key]
del active_starts[key]
if cyc >= start_cyc:
dur = cyc - start_cyc
else:
dur = (cyc + 0x100000000) - start_cyc
norm_evt = normalize_event_name(evt)
thread_totals[t][norm_evt] += dur
for t in sorted(thread_totals.keys()):
thread_name = f"Thread {t} (HVX)" if t != 10 else "Thread 10 (HMX)"
sorted_evts = sorted(thread_totals[t].items(), key=lambda item: item[0])
evt_strs = []
for evt, dur in sorted_evts:
pct = (dur / cycles * 100) if cycles > 0 else 0
evt_strs.append(f"{evt} {dur} ({pct:.1f}%)")
logger.info(f" {thread_name:<16}: " + " | ".join(evt_strs))
def generate_report(ops, top_n, width_overrides, sort_col, pmu_name=None):
if not ops:
logger.info("No valid records found.")
@@ -115,7 +327,6 @@ def generate_report(ops, top_n, width_overrides, sort_col, pmu_name=None):
# Sorting logic
actual_sort_key = COL_MAP[sort_col][2]
# We sort numeric fields descending, strings (op/dims) ascending
is_numeric = actual_sort_key.startswith("_") or actual_sort_key == "count"
sorted_groups = sorted(group_stats, key=lambda x: x[actual_sort_key], reverse=is_numeric)[:top_n]
@@ -132,7 +343,7 @@ def generate_report(ops, top_n, width_overrides, sort_col, pmu_name=None):
if "pmu" in col_name and pmu_name:
header_text = header_text.replace("PMU", pmu_name)
natural_width = max([len(row[data_key]) for row in sorted_groups] + [len(header_text)])
natural_width = max([len(str(row[data_key])) for row in sorted_groups] + [len(header_text)])
target_width = width_overrides.get(col_name, natural_width)
if target_width == 0:
@@ -152,7 +363,7 @@ def generate_report(ops, top_n, width_overrides, sort_col, pmu_name=None):
for group in sorted_groups:
row_vals = []
for i, key in enumerate(final_keys):
val = group[key]
val = str(group[key])
if len(val) > final_widths[i]:
val = val[:final_widths[i] - 3] + "..."
row_vals.append(f"{val:<{final_widths[i]}}")
@@ -167,12 +378,18 @@ def main():
parser.add_argument("--pmu-index", type=int)
parser.add_argument("--pmu-name", type=str)
parser.add_argument("--width", action='append', default=['dims:40'], help="Override column width, e.g. --width dims:50")
parser.add_argument("--timeline", type=str, nargs='?', const='summary', choices=["summary", "diagram"],
help="Output ASCII art event summary or timing diagram (default: summary)")
parser.add_argument("--filter", type=str, help="Regex filter matching against the original profile-op line")
group = parser.add_mutually_exclusive_group()
group.add_argument("--head", type=int, help="Limit to first N ops")
group.add_argument("--tail", type=int, help="Limit to last N ops")
args = parser.parse_args()
logging.basicConfig(level=logging.INFO, format='%(message)s')
# Sort validation: can't sort by PMU if index isn't provided
if "pmu" in args.sort and args.pmu_index is None:
logger.error(f"Cannot sort by '{args.sort}' without --pmu-index.")
sys.exit(1)
@@ -188,7 +405,33 @@ def main():
final_pmu_name = (args.pmu_name or f"#{args.pmu_index}") if args.pmu_index is not None else None
ops = parse_log(args.logfile, pmu_index=args.pmu_index)
generate_report(ops, args.top, overrides, args.sort, pmu_name=final_pmu_name)
if args.filter:
try:
filter_re = re.compile(args.filter)
except re.error as e:
logger.error(f"Invalid regex filter: {e}")
sys.exit(1)
ops = [op for op in ops if filter_re.search(op['op_text'])]
if args.head is not None:
ops = ops[:args.head]
elif args.tail is not None:
ops = ops[-args.tail:]
if args.timeline:
logger.info(f"\n# ASCII Timing {args.timeline.capitalize()}\n")
printed_cnt = 0
for op in ops:
if args.timeline == "summary":
print_ascii_summary(op['name'], op['dims'], op['types'], op['usec'], op['cycles'], op['trace_events'], op.get('evt_val'))
elif args.timeline == "diagram":
print_ascii_timeline(op['name'], op['dims'], op['types'], op['usec'], op['cycles'], op['trace_events'], op.get('evt_val'))
printed_cnt += 1
if printed_cnt >= args.top:
break
else:
generate_report(ops, args.top, overrides, args.sort, pmu_name=final_pmu_name)
if __name__ == "__main__":
+463
View File
@@ -0,0 +1,463 @@
#!/usr/bin/env python3
import sys
import os
import re
import argparse
import statistics
import logging
from typing import Any, Dict, List, Optional
from collections import defaultdict
logger = logging.getLogger("ggml-hexagon-trace")
op_pattern = re.compile(
r"profile-op\s+(?P<op_name>[A-Z_0-9+]+):\s+.*?\s+:\s+(?P<dims>[\d:x\s\->!]+)\s+:\s+(?P<types>[a-z\d_\s\->x]+)\s+:\s+(?P<strides>[\d:x\s\->!]+)\s+:\s+(?:op-)?usec\s+(?P<usec>\d+)\s+(?:op-)?cycles\s+(?P<cycles>\d+)(?:\s+start\s+(?P<start>\d+))?(?:\s+mhz\s+(?P<mhz>[\d.]+))?(?:\s+pmu\s+\[(?P<pmu>[\d,\s]+)\])?(?:\s+evt\s+\[(?P<evt>[\d,\s]+)\])?"
)
trace_pattern = re.compile(
r"trace-op\s+(?P<op_name>[A-Z_0-9+]+):\s+thread\s+(?P<thread>\d+)\s+event\s+(?P<event>[A-Z_0-9\-]+)\s+info\s+(?P<info>\d+)\s+(?P<state>start|stop)\s+(?P<cycles>\d+)"
)
def normalize_event_name(evt_type):
if evt_type == "HVX_COMP":
return "V-COMP"
if evt_type == "HMX_COMP":
return "M-COMP"
name = evt_type
if name.startswith("HVX_") or name.startswith("HMX_"):
name = name[4:]
return name.replace("_", "-")
class CycleUnwrapper:
def __init__(self):
self.last_raw = None
self.high_part = 0
def unwrap(self, raw):
if self.last_raw is None:
self.last_raw = raw
return raw
diff = raw - self.last_raw
if diff < -0x80000000:
self.high_part += 0x100000000
elif diff > 0x80000000:
self.high_part -= 0x100000000
self.last_raw = raw
return raw + self.high_part
def parse_log(file_path):
try:
if file_path != "-":
f = open(file_path, 'r', encoding='utf-8', errors='ignore')
else:
f = os.fdopen(0, 'r', encoding='utf-8', errors='ignore')
except FileNotFoundError:
logger.error(f"file '{file_path}' not found.")
sys.exit(1)
all_ops: List[Dict[str, Any]] = []
current_op: Optional[Dict[str, Any]] = None
unwrapper = CycleUnwrapper()
line_idx = 0
for line in f:
line_idx += 1
op_match = op_pattern.search(line)
if op_match:
cycles_start_raw = op_match.group('start')
unwrapped_cycles_start = None
if cycles_start_raw:
unwrapped_cycles_start = unwrapper.unwrap(int(cycles_start_raw))
idx = line.find("profile-op ")
op_text = line[idx + 11:].strip() if idx != -1 else line.strip()
current_op = {
'name': op_match.group('op_name'),
'dims': op_match.group('dims').strip() if op_match.group('dims') else '',
'types': op_match.group('types').strip() if op_match.group('types') else '',
'strides': op_match.group('strides').strip() if op_match.group('strides') else '',
'op_text': op_text,
'usec': int(op_match.group('usec')),
'cycles': int(op_match.group('cycles')),
'cycles_start': int(cycles_start_raw) if cycles_start_raw else None,
'unwrapped_cycles_start': unwrapped_cycles_start,
'trace_events': [],
'line_num': line_idx
}
all_ops.append(current_op)
continue
trace_match = trace_pattern.search(line)
if trace_match and current_op:
if trace_match.group('op_name') == current_op['name']:
raw_cyc = int(trace_match.group('cycles'))
current_op['trace_events'].append({
'thread': int(trace_match.group('thread')),
'event': trace_match.group('event'),
'info': int(trace_match.group('info')),
'cycles': raw_cyc,
'unwrapped_cycles': unwrapper.unwrap(raw_cyc),
'state': trace_match.group('state')
})
f.close()
return all_ops
# --- Simple protobuf encoder ---
def write_varint(val):
if val < 0:
val = (1 << 64) + val
res = bytearray()
while True:
towrite = val & 0x7f
val >>= 7
if val > 0:
res.append(towrite | 0x80)
else:
res.append(towrite)
break
return bytes(res)
def pb_field(num, wire, data):
return write_varint((num << 3) | wire) + data
def pb_varint(num, val):
return pb_field(num, 0, write_varint(val))
def pb_length_delimited(num, data):
return pb_field(num, 2, write_varint(len(data)) + data)
def pb_string(num, text):
return pb_length_delimited(num, text.encode('utf-8'))
# Message Encoders
def make_process_descriptor(pid, name):
return pb_varint(1, pid) + pb_string(6, name)
def make_thread_descriptor(pid, tid, name, sort_index=None):
payload = pb_varint(1, pid) + pb_varint(2, tid) + pb_string(5, name)
if sort_index is not None:
payload += pb_varint(3, sort_index)
return payload
def make_track_descriptor(uuid, name=None, parent_uuid=None, thread=None, process=None, sibling_merge_behavior=None, child_ordering=None, sibling_order_rank=None):
payload = pb_varint(1, uuid)
if name is not None:
payload += pb_string(2, name)
if parent_uuid is not None:
payload += pb_varint(5, parent_uuid)
if process is not None:
payload += pb_length_delimited(3, process)
if thread is not None:
payload += pb_length_delimited(4, thread)
if sibling_merge_behavior is not None:
payload += pb_varint(15, sibling_merge_behavior)
if child_ordering is not None:
payload += pb_varint(11, child_ordering)
if sibling_order_rank is not None:
payload += pb_varint(12, sibling_order_rank)
return payload
def make_debug_annotation(name, string_val=None, int_val=None):
payload = pb_string(10, name)
if string_val is not None:
payload += pb_string(6, string_val)
elif int_val is not None:
payload += pb_varint(4, int_val)
return payload
def make_track_event(event_type, track_uuid, name=None, category=None, debug_annotations=None):
payload = pb_varint(9, event_type)
payload += pb_varint(11, track_uuid)
if name is not None:
payload += pb_string(23, name)
if category is not None:
payload += pb_string(22, category)
if debug_annotations is not None:
for da in debug_annotations:
payload += pb_length_delimited(4, da)
return payload
def make_trace_packet(timestamp, track_event=None, track_descriptor=None, seq_id=1):
payload = pb_varint(8, timestamp)
payload += pb_varint(10, seq_id)
if track_event is not None:
payload += pb_length_delimited(11, track_event)
if track_descriptor is not None:
payload += pb_length_delimited(60, track_descriptor)
return payload
def write_trace_packet_to_file(f, packet_bytes):
# Write as field 1 of top-level Trace message
f.write(pb_length_delimited(1, packet_bytes))
# --- End Protobuf Encoder ---
def generate_perfetto_trace(filtered_ops, output_path):
if not filtered_ops:
logger.warning("No operators found after filtering.")
return
# Compute average frequency
frequencies = []
for op in filtered_ops:
if op['usec'] > 0 and op['cycles'] > 0:
frequencies.append(op['cycles'] / op['usec'])
avg_freq_mhz = statistics.mean(frequencies) if frequencies else 1000.0
if avg_freq_mhz <= 0:
avg_freq_mhz = 1000.0
# Assign start and end cycles to each operator
for op in filtered_ops:
op['start_cycles'] = op['unwrapped_cycles_start']
op['end_cycles'] = op['start_cycles'] + op['cycles']
global_min_cyc = min(op['start_cycles'] for op in filtered_ops if op['start_cycles'] is not None)
# Process events
completed_events = []
for op in filtered_ops:
events = op['trace_events']
if not events:
continue
events = sorted(events, key=lambda e: e['unwrapped_cycles'])
active_starts = {}
for e in events:
t = e['thread']
evt = e['event']
info = e['info']
state = e['state']
cyc = e['unwrapped_cycles']
key = (t, evt, info)
if state == 'start':
active_starts[key] = cyc
elif state == 'stop':
if key in active_starts:
start_cyc = active_starts[key]
del active_starts[key]
completed_events.append({
'thread': t,
'event': evt,
'info': info,
'start_cyc': start_cyc,
'end_cyc': cyc,
'op_name': op['name']
})
completed_events.sort(key=lambda e: e['start_cyc'])
# Convert event times to microseconds and apply clamp rounded to 1ns resolution (3 decimals)
for e in completed_events:
start_us = (e['start_cyc'] - global_min_cyc) / avg_freq_mhz
dur_us = (e['end_cyc'] - e['start_cyc']) / avg_freq_mhz
e['ts_ns'] = int(round(start_us * 1000))
e['dur_ns'] = int(round(max(dur_us, 0.1) * 1000))
# Allocate slots (sub-tracks) to prevent overlaps on same virtual track
active_slots = defaultdict(list)
for e in completed_events:
t = e['thread']
evt = e['event']
ts = e['ts_ns']
dur = e['dur_ns']
norm_evt = normalize_event_name(evt)
if norm_evt == "DMA":
track_key = (t, "DMA")
elif t == 10:
track_key = (t, "HMX")
else:
track_key = (t, "HVX")
slots = active_slots[track_key]
allocated_slot = -1
for idx, slot_end_ns in enumerate(slots):
if ts >= slot_end_ns:
slots[idx] = ts + dur
allocated_slot = idx
break
if allocated_slot == -1:
slots.append(ts + dur)
allocated_slot = len(slots) - 1
e['slot'] = allocated_slot
# Generate Track IDs and track definitions
used_tracks = {}
for e in completed_events:
t = e['thread']
evt = e['event']
slot = e['slot']
norm_evt = normalize_event_name(evt)
if norm_evt == "DMA":
track_evt = "DMA"
evt_id = 1
elif t == 10:
track_evt = "HMX"
evt_id = 3
else:
track_evt = "HVX"
evt_id = 2
t_sort = 1 if t == 10 else t + 2
# Unique UUID for each sub-track
if t == 10:
uuid = 20 # HMX thread track UUID
else:
uuid = int(t_sort * 1000000 + evt_id * 1000 + slot)
e['uuid'] = uuid
used_tracks[uuid] = (t, track_evt, slot)
with open(output_path, "wb") as f:
# Define Process with EXPLICIT child sorting
proc_desc = make_process_descriptor(1, "HTP NPU")
proc_packet = make_trace_packet(0, track_descriptor=make_track_descriptor(1, process=proc_desc, child_ordering=3))
write_trace_packet_to_file(f, proc_packet)
# Define Operators Track (UUID = 2) as a thread track at rank 1, tid 8
op_thread_desc = make_thread_descriptor(1, 8, "Ops", sort_index=1)
op_packet = make_trace_packet(0, track_descriptor=make_track_descriptor(2, parent_uuid=1, thread=op_thread_desc))
write_trace_packet_to_file(f, op_packet)
# Define HMX Thread Track (UUID = 20) at rank 2, tid 9
hmx_thread_desc = make_thread_descriptor(1, 9, "HMX", sort_index=2)
hmx_packet = make_trace_packet(0, track_descriptor=make_track_descriptor(20, parent_uuid=1, thread=hmx_thread_desc))
write_trace_packet_to_file(f, hmx_packet)
# Define Thread Tracks (T0, T1, ..., T9)
unique_threads = sorted(list(set(t for (t, _, _) in used_tracks.values() if t != 10)))
for t in unique_threads:
thread_uuid = 10 + t
thread_name = f"T{t}"
# Sort order starts from index 3 (T0 -> 3, T1 -> 4, etc.)
sort_index = 3 + t
tid = 10 + t
thread_desc = make_thread_descriptor(1, tid, thread_name, sort_index=sort_index)
thread_packet = make_trace_packet(0, track_descriptor=make_track_descriptor(
thread_uuid,
parent_uuid=1,
thread=thread_desc,
sibling_order_rank=sort_index,
child_ordering=3 # Explicit child sorting for sub-tracks
))
write_trace_packet_to_file(f, thread_packet)
# Define Track descriptors for sub-tracks parented to thread tracks
for uuid in sorted(used_tracks.keys()):
if uuid == 20:
continue
t, evt, slot = used_tracks[uuid]
name = f"T{t} {evt}"
rank = 0 if evt == "HVX" else 1
parent_thread_uuid = 10 + t
# Sibling merge behavior: 1 (SIBLING_MERGE_BEHAVIOR_BY_TRACK_NAME)
track_desc = make_track_descriptor(
uuid=uuid,
name=name,
parent_uuid=parent_thread_uuid,
sibling_merge_behavior=1,
sibling_order_rank=rank
)
track_packet = make_trace_packet(0, track_descriptor=track_desc)
write_trace_packet_to_file(f, track_packet)
# Emit Operators
last_op_end_ns = 0
for op in filtered_ops:
op_start_ns = int(round(((op['start_cycles'] - global_min_cyc) / avg_freq_mhz) * 1000))
op_dur_ns = int(round((op['cycles'] / avg_freq_mhz) * 1000))
if op_start_ns < last_op_end_ns:
op_start_ns = last_op_end_ns
clamped_dur = max(op_dur_ns, 100) # Clamp to 100ns (0.1us)
# Debug annotations for Ops
debug_annots = []
if 'line_num' in op:
debug_annots.append(make_debug_annotation("line", int_val=op['line_num']))
if 'strides' in op and op['strides']:
debug_annots.append(make_debug_annotation("strides", string_val=op['strides']))
# Slice Begin
evt_begin = make_track_event(1, 2, name=f"{op['name']} ({op['dims']})", category="operator", debug_annotations=debug_annots)
packet_begin = make_trace_packet(op_start_ns, track_event=evt_begin)
write_trace_packet_to_file(f, packet_begin)
# Slice End
evt_end = make_track_event(2, 2)
packet_end = make_trace_packet(op_start_ns + clamped_dur, track_event=evt_end)
write_trace_packet_to_file(f, packet_end)
last_op_end_ns = op_start_ns + clamped_dur
# Emit Thread Trace Events
for e in completed_events:
norm_name = normalize_event_name(e['event'])
name = f"DMA {e['info']}" if norm_name == "DMA" else norm_name
# Slice Begin
evt_begin = make_track_event(1, e['uuid'], name=name, category="trace")
packet_begin = make_trace_packet(e['ts_ns'], track_event=evt_begin)
write_trace_packet_to_file(f, packet_begin)
# Slice End
evt_end = make_track_event(2, e['uuid'])
packet_end = make_trace_packet(e['ts_ns'] + e['dur_ns'], track_event=evt_end)
write_trace_packet_to_file(f, packet_end)
logger.info(f"Successfully generated Perfetto trace at {output_path}")
def main():
parser = argparse.ArgumentParser(description="Convert Hexagon Op profile logs to native Perfetto Protobuf traces.")
parser.add_argument("logfile", help="Path to hex-log profile file")
parser.add_argument("-o", "--output", default="optrace.perfetto-trace", help="Output trace file path (default: optrace.perfetto-trace)")
parser.add_argument("--filter", type=str, help="Regex filter matching against the original profile-op line")
group = parser.add_mutually_exclusive_group()
group.add_argument("--head", type=int, help="Limit to first N ops")
group.add_argument("--tail", type=int, help="Limit to last N ops")
args = parser.parse_args()
logging.basicConfig(level=logging.INFO, format='%(message)s')
ops = parse_log(args.logfile)
if args.filter:
try:
filter_re = re.compile(args.filter)
except re.error as e:
logger.error(f"Invalid regex filter: {e}")
sys.exit(1)
ops = [op for op in ops if filter_re.search(op['op_text'])]
if args.head is not None:
ops = ops[:args.head]
elif args.tail is not None:
ops = ops[-args.tail:]
generate_perfetto_trace(ops, args.output)
if __name__ == "__main__":
main()
+23 -4
View File
@@ -20,6 +20,7 @@ set(LLAMA_UI_GZIP "" CACHE STRING "Apply gzip compress to assets to save ban
set(DIST_DIR "${UI_BINARY_DIR}/dist")
set(SRC_DIST_DIR "${UI_SOURCE_DIR}/dist")
set(WORK_DIR "${UI_BINARY_DIR}/ui-src")
set(STAMP_FILE "${UI_BINARY_DIR}/.ui-stamp")
set(UI_CPP "${UI_BINARY_DIR}/ui.cpp")
set(UI_H "${UI_BINARY_DIR}/ui.h")
@@ -64,6 +65,22 @@ function(npm_build_should_skip out_var)
set(${out_var} TRUE PARENT_SCOPE)
endfunction()
function(stage_sources)
if(EXISTS "${WORK_DIR}")
file(GLOB staged RELATIVE "${WORK_DIR}" "${WORK_DIR}/*")
list(REMOVE_ITEM staged "node_modules")
foreach(entry ${staged})
file(REMOVE_RECURSE "${WORK_DIR}/${entry}")
endforeach()
endif()
file(COPY "${UI_SOURCE_DIR}/"
DESTINATION "${WORK_DIR}"
NO_SOURCE_PERMISSIONS
PATTERN "node_modules" EXCLUDE
)
endfunction()
function(npm_build out_var)
set(${out_var} FALSE PARENT_SCOPE)
@@ -89,14 +106,16 @@ function(npm_build out_var)
return()
endif()
stage_sources()
# npm writes node_modules/.package-lock.json on every successful install,
# so a package-lock.json newer than this marker means node_modules is stale
set(NPM_MARKER "${UI_SOURCE_DIR}/node_modules/.package-lock.json")
set(NPM_MARKER "${WORK_DIR}/node_modules/.package-lock.json")
set(need_install FALSE)
if(NOT EXISTS "${NPM_MARKER}")
set(need_install TRUE)
else()
file(TIMESTAMP "${UI_SOURCE_DIR}/package-lock.json" lock_ts)
file(TIMESTAMP "${WORK_DIR}/package-lock.json" lock_ts)
file(TIMESTAMP "${NPM_MARKER}" marker_ts)
if(lock_ts STRGREATER marker_ts)
set(need_install TRUE)
@@ -107,7 +126,7 @@ function(npm_build out_var)
message(STATUS "UI: running npm install")
execute_process(
COMMAND ${NPM_EXECUTABLE} install
WORKING_DIRECTORY "${UI_SOURCE_DIR}"
WORKING_DIRECTORY "${WORK_DIR}"
RESULT_VARIABLE rc
ERROR_VARIABLE err
)
@@ -124,7 +143,7 @@ function(npm_build out_var)
execute_process(
COMMAND ${CMAKE_COMMAND} -E env "LLAMA_UI_OUT_DIR=${DIST_DIR}" "LLAMA_UI_VERSION=${HF_VERSION}" "LLAMA_BUILD_NUMBER=${LLAMA_BUILD_NUMBER}"
${NPM_EXECUTABLE} run build
WORKING_DIRECTORY "${UI_SOURCE_DIR}"
WORKING_DIRECTORY "${WORK_DIR}"
RESULT_VARIABLE rc
ERROR_VARIABLE err
)
+1 -1
View File
@@ -1382,7 +1382,7 @@ int llama_context::encode(const llama_batch & batch_inp) {
const auto & hparams = model.hparams;
// eagle3/DFlash: features as encoder input, and non-draft paths fall back to model's input dim
const int64_t n_embd = hparams.n_embd_inp();
const int64_t n_embd = hparams.n_embd_inp_enc();
const int64_t n_vocab = model.vocab.n_tokens();
// note: during encode, we always pass the full sequence starting from pos = 0
+4
View File
@@ -104,6 +104,10 @@ uint32_t llama_hparams::n_embd_inp() const {
return n_embd_inp;
}
uint32_t llama_hparams::n_embd_inp_enc() const {
return n_embd_inp_enc_impl > 0 ? n_embd_inp_enc_impl : n_embd_inp();
}
uint32_t llama_hparams::n_embd_out() const {
return n_embd_out_impl > 0 ? n_embd_out_impl : n_embd;
}
+7
View File
@@ -189,6 +189,10 @@ struct llama_hparams {
// input embedding dimension (0 = use n_embd)
uint32_t n_embd_inp_impl = 0;
// encoder input embedding dimension (0 = use n_embd_inp())
// e.g. the eagle3 encoder fuses target_layers * target_hidden features
uint32_t n_embd_inp_enc_impl = 0;
// output embedding dimension (0 = use n_embd)
uint32_t n_embd_out_impl = 0;
@@ -305,6 +309,9 @@ struct llama_hparams {
// dimension of main + auxiliary input embeddings
uint32_t n_embd_inp() const;
// dimension of the encoder input embeddings
uint32_t n_embd_inp_enc() const;
// dimension of output embeddings
uint32_t n_embd_out() const;
+1 -1
View File
@@ -249,7 +249,7 @@ static bool llama_prepare_model_devices(const llama_model_params & params, llama
}
// if using single GPU mode, remove all except the main GPU
if (params.split_mode == LLAMA_SPLIT_MODE_NONE) {
if (params.split_mode == LLAMA_SPLIT_MODE_NONE && !model->devices.empty()) {
if (params.main_gpu < 0) {
model->devices.clear();
} else {
+4 -4
View File
@@ -19,7 +19,7 @@ void llama_model_eagle3::load_arch_hparams(llama_model_loader & ml) {
ml.get_key(LLM_KV_TARGET_HIDDEN_SIZE, n_embd_tgt);
LLAMA_LOG_INFO("%s: EAGLE3 n_embd_tgt = %u (draft n_embd = %u)\n", __func__, n_embd_tgt, hparams.n_embd);
hparams.n_embd_inp_impl = (uint32_t) target_layer_ids.size() * n_embd_tgt;
hparams.n_embd_inp_enc_impl = (uint32_t) target_layer_ids.size() * n_embd_tgt;
// eagle3 norm_before_residual (optional, default false)
// compatible with Readhat eagle3 speculator model
@@ -34,7 +34,7 @@ void llama_model_eagle3::load_arch_hparams(llama_model_loader & ml) {
void llama_model_eagle3::load_arch_tensors(llama_model_loader &) {
LLAMA_LOAD_LOCALS;
const int64_t n_embd_inp = hparams.n_embd_inp();
const int64_t n_embd_inp = hparams.n_embd_inp_enc();
const int64_t n_embd_attn_input = 2 * n_embd;
// Get vocab size from the d2t tensor in the GGUF file (optional - only needed if eagle3 has different vocab_size than target)
@@ -109,8 +109,8 @@ ggml_tensor * llama_model_eagle3::graph<true>::build_inp_embd_enc() const {
// Input: Target model features (3 layers concatenated: low, mid, high)
// Data will be provided via ubatch->embd in encode_eagle3_features()
auto inp_target = std::make_unique<llm_graph_input_embd>(hparams.n_embd_inp());
inp_target->embd = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32,hparams.n_embd_inp(), n_tokens);
auto inp_target = std::make_unique<llm_graph_input_embd>(hparams.n_embd_inp_enc());
inp_target->embd = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, hparams.n_embd_inp_enc(), n_tokens);
ggml_set_input(inp_target->embd);
cur = inp_target->embd;
+6 -8
View File
@@ -6,11 +6,10 @@ Apply LORA adapters to base model and export the resulting model.
usage: llama-export-lora [options]
options:
-m, --model model path from which to load base model (default '')
--lora FNAME path to LoRA adapter (can be repeated to use multiple adapters)
--lora-scaled FNAME S path to LoRA adapter with user defined scaling S (can be repeated to use multiple adapters)
-t, --threads N number of threads to use during computation (default: 4)
-o, --output FNAME output file (default: 'ggml-lora-merged-f16.gguf')
-m, --model FNAME model path from which to load base model
--lora FNAME path to LoRA adapter (use comma-separated values to load multiple adapters)
--lora-scaled FNAME:SCALE,... path to LoRA adapter with user defined scaling (format: FNAME:SCALE,...)
-o, --output, --output-file FNAME output file (default: 'ggml-lora-merged-f16.gguf')
```
For example:
@@ -22,12 +21,11 @@ For example:
--lora lora-open-llama-3b-v2-english2tokipona-chat-LATEST.gguf
```
Multiple LORA adapters can be applied by passing multiple `--lora FNAME` or `--lora-scaled FNAME S` command line parameters:
Multiple LORA adapters can be applied by passing comma-separated values to `--lora FNAME` or `--lora-scaled FNAME:SCALE,...`:
```bash
./bin/llama-export-lora \
-m your_base_model.gguf \
-o your_merged_model.gguf \
--lora-scaled lora_task_A.gguf 0.5 \
--lora-scaled lora_task_B.gguf 0.5
--lora-scaled lora_task_A.gguf:0.5,lora_task_B.gguf:0.5
```
+35
View File
@@ -0,0 +1,35 @@
# libmtmd dev guide
## History
Please refer to [multimodal.md](../../docs/multimodal.md) for a broader context.
In short:
- `libmtmd` started as a wrapper around `libllava` / `clip.cpp`
- Various components that used to be in `clip.cpp` are moved progressively to mtmd. For example, preprocessor is now part of mtmd
## Terminologies
- mtmd: **M**ul**T**i**M**o**D**al
- bitmap: representing a raw input data, for example: RGB image, PCM audio
- tiles / slices: for llava-uhd-style models, the preprocessor breaks a large input into smaller square images called tiles or slices
- chunk: a mtmd_input_chunk represents a preprocessed input that can then be passed through `mtmd_encode()`
## Pipeline
A typical pipeline of the core libmtmd is as follows:
- A bitmap (RGB image or PCM audio) is created
- Bitmap and the text prompt is provided to `mtmd_tokenize()` that breaks the input into chunks
- The tokenizer function first expands a "lazy" bitmap if it finds one. Typically, this is used by video, so that one media token corresponds to one input bitmap
- For models that support "fused" temporal frames like Qwen-VL, the tokenizer tries to merge pair of consecutive frames into one batch
- The preprocessor will then be called, which produces a list of chunks
- Depending on the model itself, special tokens will be injected to separate image chunks (i.e. llava-uhd-style models)
- Multiple bitmaps may be batched together to form a larger `mtmd_batch()`
- Single image or batch is encoded, via `mtmd_encode()` or `mtmd_batch_encode()`
- Get the output embeddings
## Helper
We provide a set of helper functions via `mtmd_helper` to make using libmtmd easier. The helper provides:
- Image, audio and video file decoding (for example, decode raw JPEG into RGB bitmap)
- Manage `llama_batch` and calls to `llama_decode`
+52 -81
View File
@@ -367,56 +367,56 @@ enum projector_type {
};
static std::map<projector_type, std::string> PROJECTOR_TYPE_NAMES = {
{ PROJECTOR_TYPE_MLP, "mlp" },
{ PROJECTOR_TYPE_LDP, "ldp" },
{ PROJECTOR_TYPE_LDPV2, "ldpv2"},
{ PROJECTOR_TYPE_MINICPMV, "resampler"},
{ PROJECTOR_TYPE_GLM_EDGE, "adapter"},
{ PROJECTOR_TYPE_QWEN2VL, "qwen2vl_merger"},
{ PROJECTOR_TYPE_QWEN25VL, "qwen2.5vl_merger"},
{ PROJECTOR_TYPE_QWEN3VL, "qwen3vl_merger"},
{ PROJECTOR_TYPE_STEP3VL, "step3vl"},
{ PROJECTOR_TYPE_GEMMA3, "gemma3"},
{ PROJECTOR_TYPE_GEMMA3NV, "gemma3nv"},
{ PROJECTOR_TYPE_GEMMA3NA, "gemma3na"},
{ PROJECTOR_TYPE_GEMMA4V, "gemma4v"},
{ PROJECTOR_TYPE_GEMMA4A, "gemma4a"},
{ PROJECTOR_TYPE_GEMMA4UV, "gemma4uv"},
{ PROJECTOR_TYPE_GEMMA4UA, "gemma4ua"},
{ PROJECTOR_TYPE_PHI4, "phi4"},
{ PROJECTOR_TYPE_IDEFICS3, "idefics3"},
{ PROJECTOR_TYPE_PIXTRAL, "pixtral"},
{ PROJECTOR_TYPE_ULTRAVOX, "ultravox"},
{ PROJECTOR_TYPE_INTERNVL, "internvl"},
{ PROJECTOR_TYPE_LLAMA4, "llama4"},
{ PROJECTOR_TYPE_QWEN2A, "qwen2a"},
{ PROJECTOR_TYPE_QWEN3A, "qwen3a"},
{ PROJECTOR_TYPE_GLMA, "glma"},
{ PROJECTOR_TYPE_QWEN25O, "qwen2.5o"},
{ PROJECTOR_TYPE_VOXTRAL, "voxtral"},
{ PROJECTOR_TYPE_MERALION, "meralion"},
{ PROJECTOR_TYPE_MUSIC_FLAMINGO, "musicflamingo"},
{ PROJECTOR_TYPE_LFM2, "lfm2"},
{ PROJECTOR_TYPE_KIMIVL, "kimivl"},
{ PROJECTOR_TYPE_PADDLEOCR, "paddleocr"},
{ PROJECTOR_TYPE_LIGHTONOCR,"lightonocr"},
{ PROJECTOR_TYPE_COGVLM, "cogvlm"},
{ PROJECTOR_TYPE_JANUS_PRO, "janus_pro"},
{ PROJECTOR_TYPE_DOTS_OCR, "dots_ocr"},
{ PROJECTOR_TYPE_DEEPSEEKOCR,"deepseekocr"},
{ PROJECTOR_TYPE_DEEPSEEKOCR2,"deepseekocr2"},
{ PROJECTOR_TYPE_LFM2A, "lfm2a"},
{ PROJECTOR_TYPE_GLM4V, "glm4v"},
{ PROJECTOR_TYPE_YOUTUVL, "youtuvl"},
{ PROJECTOR_TYPE_YASA2, "yasa2"},
{ PROJECTOR_TYPE_KIMIK25, "kimik25"},
{ PROJECTOR_TYPE_NEMOTRON_V2_VL, "nemotron_v2_vl"},
{ PROJECTOR_TYPE_EXAONE4_5, "exaone4_5"},
{ PROJECTOR_TYPE_HUNYUANVL, "hunyuanvl"},
{ PROJECTOR_TYPE_MINICPMV4_6, "minicpmv4_6"},
{ PROJECTOR_TYPE_GRANITE_SPEECH, "granite_speech"},
{ PROJECTOR_TYPE_MIMOVL, "mimovl"},
{ PROJECTOR_TYPE_GRANITE4_VISION, "granite4_vision"},
{ PROJECTOR_TYPE_MLP, "mlp" },
{ PROJECTOR_TYPE_LDP, "ldp" },
{ PROJECTOR_TYPE_LDPV2, "ldpv2"},
{ PROJECTOR_TYPE_MINICPMV, "resampler"},
{ PROJECTOR_TYPE_GLM_EDGE, "adapter"},
{ PROJECTOR_TYPE_QWEN2VL, "qwen2vl_merger"},
{ PROJECTOR_TYPE_QWEN25VL, "qwen2.5vl_merger"},
{ PROJECTOR_TYPE_QWEN3VL, "qwen3vl_merger"},
{ PROJECTOR_TYPE_STEP3VL, "step3vl"},
{ PROJECTOR_TYPE_GEMMA3, "gemma3"},
{ PROJECTOR_TYPE_GEMMA3NV, "gemma3nv"},
{ PROJECTOR_TYPE_GEMMA3NA, "gemma3na"},
{ PROJECTOR_TYPE_GEMMA4V, "gemma4v"},
{ PROJECTOR_TYPE_GEMMA4A, "gemma4a"},
{ PROJECTOR_TYPE_GEMMA4UV, "gemma4uv"},
{ PROJECTOR_TYPE_GEMMA4UA, "gemma4ua"},
{ PROJECTOR_TYPE_PHI4, "phi4"},
{ PROJECTOR_TYPE_IDEFICS3, "idefics3"},
{ PROJECTOR_TYPE_PIXTRAL, "pixtral"},
{ PROJECTOR_TYPE_ULTRAVOX, "ultravox"},
{ PROJECTOR_TYPE_INTERNVL, "internvl"},
{ PROJECTOR_TYPE_LLAMA4, "llama4"},
{ PROJECTOR_TYPE_QWEN2A, "qwen2a"},
{ PROJECTOR_TYPE_QWEN3A, "qwen3a"},
{ PROJECTOR_TYPE_GLMA, "glma"},
{ PROJECTOR_TYPE_QWEN25O, "qwen2.5o"},
{ PROJECTOR_TYPE_VOXTRAL, "voxtral"},
{ PROJECTOR_TYPE_MERALION, "meralion"},
{ PROJECTOR_TYPE_MUSIC_FLAMINGO, "musicflamingo"},
{ PROJECTOR_TYPE_LFM2, "lfm2"},
{ PROJECTOR_TYPE_KIMIVL, "kimivl"},
{ PROJECTOR_TYPE_PADDLEOCR, "paddleocr"},
{ PROJECTOR_TYPE_LIGHTONOCR, "lightonocr"},
{ PROJECTOR_TYPE_COGVLM, "cogvlm"},
{ PROJECTOR_TYPE_JANUS_PRO, "janus_pro"},
{ PROJECTOR_TYPE_DOTS_OCR, "dots_ocr"},
{ PROJECTOR_TYPE_DEEPSEEKOCR, "deepseekocr"},
{ PROJECTOR_TYPE_DEEPSEEKOCR2, "deepseekocr2"},
{ PROJECTOR_TYPE_LFM2A, "lfm2a"},
{ PROJECTOR_TYPE_GLM4V, "glm4v"},
{ PROJECTOR_TYPE_YOUTUVL, "youtuvl"},
{ PROJECTOR_TYPE_YASA2, "yasa2"},
{ PROJECTOR_TYPE_KIMIK25, "kimik25"},
{ PROJECTOR_TYPE_NEMOTRON_V2_VL, "nemotron_v2_vl"},
{ PROJECTOR_TYPE_EXAONE4_5, "exaone4_5"},
{ PROJECTOR_TYPE_HUNYUANVL, "hunyuanvl"},
{ PROJECTOR_TYPE_MINICPMV4_6, "minicpmv4_6"},
{ PROJECTOR_TYPE_GRANITE_SPEECH, "granite_speech"},
{ PROJECTOR_TYPE_MIMOVL, "mimovl"},
{ PROJECTOR_TYPE_GRANITE4_VISION, "granite4_vision"},
};
static projector_type clip_projector_type_from_string(const std::string & str) {
@@ -640,47 +640,18 @@ static void clip_log_internal(enum ggml_log_level level, const char * format, ..
// cpp wrappers
//
// wrapper for clip_image_size
struct clip_image_size_deleter {
void operator()(clip_image_size * val) { clip_image_size_free(val); }
};
typedef std::unique_ptr<clip_image_size, clip_image_size_deleter> clip_image_size_ptr;
// wrapper for clip_image_u8
struct clip_image_u8_deleter {
void operator()(clip_image_u8 * val) { clip_image_u8_free(val); }
};
typedef std::unique_ptr<clip_image_u8, clip_image_u8_deleter> clip_image_u8_ptr;
// wrapper for clip_image_f32
struct clip_image_f32_deleter {
void operator()(clip_image_f32 * val) { clip_image_f32_free(val); }
};
typedef std::unique_ptr<clip_image_f32, clip_image_f32_deleter> clip_image_f32_ptr;
struct clip_image_u8_batch {
std::vector<clip_image_u8_ptr> entries;
};
struct clip_image_f32_batch {
std::vector<clip_image_f32_ptr> entries;
std::vector<clip_image_f32> entries;
bool is_audio = false;
// for llava-uhd style models, we need to know the grid size
// note: entries.size() == grid_x * grid_y + 1 (one overview image)
int grid_x = 0;
int grid_y = 0;
clip_image_f32_batch clone() const {
clip_image_f32_batch new_batch{
/* entries */ {},
/* is_audio */ is_audio,
/* grid_x */ grid_x,
/* grid_y */ grid_y,
};
new_batch.entries.reserve(entries.size());
for (const auto & entry : entries) {
new_batch.entries.emplace_back(new clip_image_f32(*entry));
new_batch.entries.emplace_back(entry); // copy
}
return new_batch;
}
+24 -95
View File
@@ -865,7 +865,7 @@ ggml_tensor * clip_graph::build_patch_merge_permute(ggml_tensor * cur, int scale
}
static std::unique_ptr<clip_graph> clip_get_graph_builder(clip_ctx * ctx, const clip_image_f32_batch & imgs) {
const clip_image_f32 & img = *imgs.entries[0];
const clip_image_f32 & img = imgs.entries[0];
std::unique_ptr<clip_graph> builder;
switch (ctx->proj_type()) {
@@ -2825,16 +2825,16 @@ struct clip_model_loader {
// create a fake batch
const auto & hparams = ctx_clip.model.hparams;
clip_image_f32_batch batch;
clip_image_f32_ptr img(clip_image_f32_init());
clip_image_f32 img;
if (ctx_clip.model.modality == CLIP_MODALITY_VISION) {
const int sz = hparams.warmup_image_size;
img->set_size({sz, sz}, false, false);
img.set_size({sz, sz}, false, false);
LOG_INF("%s: warmup with image size = %d x %d\n", __func__, sz, sz);
} else {
img->set_size({hparams.warmup_audio_size, hparams.n_mel_bins}, false, false);
img.set_size({hparams.warmup_audio_size, hparams.n_mel_bins}, false, false);
LOG_INF("%s: warmup with audio size = %d\n", __func__, hparams.warmup_audio_size);
}
batch.entries.push_back(std::move(img));
batch.entries.push_back(img);
return batch;
}
@@ -3124,64 +3124,6 @@ struct clip_cap clip_get_cap(const char * fname) {
return res;
}
struct clip_image_size * clip_image_size_init() {
struct clip_image_size * load_image_size = new struct clip_image_size();
load_image_size->width = 448;
load_image_size->height = 448;
return load_image_size;
}
struct clip_image_u8 * clip_image_u8_init() {
return new clip_image_u8();
}
struct clip_image_f32 * clip_image_f32_init() {
return new clip_image_f32();
}
struct clip_image_f32_batch * clip_image_f32_batch_init() {
return new clip_image_f32_batch();
}
void clip_image_size_free(struct clip_image_size * load_image_size) {
if (load_image_size == nullptr) {
return;
}
delete load_image_size;
}
void clip_image_u8_free(struct clip_image_u8 * img) { delete img; }
void clip_image_f32_free(struct clip_image_f32 * img) { delete img; }
void clip_image_u8_batch_free(struct clip_image_u8_batch * batch) { delete batch; }
void clip_image_f32_batch_free(struct clip_image_f32_batch * batch) { delete batch; }
size_t clip_image_f32_batch_n_images(const struct clip_image_f32_batch * batch) {
return batch->entries.size();
}
size_t clip_image_f32_batch_nx(const struct clip_image_f32_batch * batch, int idx) {
if (idx < 0 || idx >= (int)batch->entries.size()) {
LOG_ERR("%s: invalid index %d\n", __func__, idx);
return 0;
}
return batch->entries[idx]->nx();
}
size_t clip_image_f32_batch_ny(const struct clip_image_f32_batch * batch, int idx) {
if (idx < 0 || idx >= (int)batch->entries.size()) {
LOG_ERR("%s: invalid index %d\n", __func__, idx);
return 0;
}
return batch->entries[idx]->ny();
}
clip_image_f32 * clip_image_f32_get_img(const struct clip_image_f32_batch * batch, int idx) {
if (idx < 0 || idx >= (int)batch->entries.size()) {
LOG_ERR("%s: invalid index %d\n", __func__, idx);
return nullptr;
}
return batch->entries[idx].get();
}
void clip_free(clip_ctx * ctx) {
if (ctx == nullptr) {
return;
@@ -3189,23 +3131,11 @@ void clip_free(clip_ctx * ctx) {
delete ctx;
}
int32_t clip_get_image_size(const struct clip_ctx * ctx) {
return ctx->model.hparams.image_size;
}
int32_t clip_get_patch_size(const struct clip_ctx * ctx) {
return ctx->model.hparams.patch_size;
}
int32_t clip_get_hidden_size(const struct clip_ctx * ctx) {
return ctx->model.hparams.n_embd;
}
const char * clip_patch_merge_type(const struct clip_ctx * ctx) {
return ctx->model.hparams.mm_patch_merge_type == PATCH_MERGE_SPATIAL_UNPAD ? "spatial_unpad" : "flat";
}
int clip_n_output_tokens_x(const struct clip_ctx * ctx, struct clip_image_f32 * img) {
int clip_n_output_tokens_x(const clip_ctx * ctx, const clip_image_f32 * img) {
const auto & params = ctx->model.hparams;
const int n_total = clip_n_output_tokens(ctx, img);
const auto & proj = ctx->proj_type();
@@ -3228,7 +3158,7 @@ int clip_n_output_tokens_x(const struct clip_ctx * ctx, struct clip_image_f32 *
return n_total;
}
int clip_n_output_tokens_y(const struct clip_ctx * ctx, struct clip_image_f32 * img) {
int clip_n_output_tokens_y(const clip_ctx * ctx, const clip_image_f32 * img) {
const auto & params = ctx->model.hparams;
const auto & proj = ctx->proj_type();
switch (proj) {
@@ -3250,7 +3180,7 @@ int clip_n_output_tokens_y(const struct clip_ctx * ctx, struct clip_image_f32 *
return 1;
}
int clip_n_output_tokens(const struct clip_ctx * ctx, struct clip_image_f32 * img) {
int clip_n_output_tokens(const clip_ctx * ctx, const clip_image_f32 * img) {
const auto & params = ctx->model.hparams;
// for models with fixed size image, the input image is already pre-processed and resized to square
@@ -3500,16 +3430,15 @@ int clip_n_output_tokens(const struct clip_ctx * ctx, struct clip_image_f32 * im
return n_patches;
}
bool clip_image_encode(struct clip_ctx * ctx, const int n_threads, clip_image_f32 * img, std::vector<float> & out_vec) {
bool clip_image_encode(struct clip_ctx * ctx, int n_threads, const clip_image_f32 * img, std::vector<float> & out_vec) {
clip_image_f32_batch imgs;
clip_image_f32_ptr img_copy(clip_image_f32_init());
*img_copy = *img;
clip_image_f32 img_copy = *img;
imgs.entries.push_back(std::move(img_copy));
return clip_image_batch_encode(ctx, n_threads, &imgs, out_vec);
}
bool clip_image_batch_encode(clip_ctx * ctx, const int n_threads, const clip_image_f32_batch * imgs_c_ptr, std::vector<float> & out_batch_embd) {
bool clip_image_batch_encode(clip_ctx * ctx, int n_threads, const clip_image_f32_batch * imgs_c_ptr, std::vector<float> & out_batch_embd) {
const clip_image_f32_batch & imgs = *imgs_c_ptr;
int n_batch_cur = imgs.entries.size();
@@ -3533,8 +3462,8 @@ bool clip_image_batch_encode(clip_ctx * ctx, const int n_threads, const clip_ima
const auto & model = ctx->model;
const auto & hparams = model.hparams;
const int image_size_width = imgs.entries[0]->nx();
const int image_size_height = imgs.entries[0]->ny();
const int image_size_width = imgs.entries[0].nx();
const int image_size_height = imgs.entries[0].ny();
const int patch_size = hparams.patch_size;
const int num_patches = ((image_size_width / patch_size) * (image_size_height / patch_size));
@@ -3572,7 +3501,7 @@ bool clip_image_batch_encode(clip_ctx * ctx, const int n_threads, const clip_ima
if (!imgs.is_audio) {
size_t nelem = 0;
for (const auto & img : imgs.entries) {
nelem += img->nx() * img->ny() * 3;
nelem += img.nx() * img.ny() * 3;
}
std::vector<float> inp_raw(nelem);
@@ -3590,13 +3519,13 @@ bool clip_image_batch_encode(clip_ctx * ctx, const int n_threads, const clip_ima
// IMPORTANT: [QWEN_VIDEO] the batch dim is currently used for temporal dim in Qwen-VL models
// All entries must have the same spatial size (enforced by can_batch_with() during merging)
{
const int nx = imgs.entries[0]->nx();
const int ny = imgs.entries[0]->ny();
const int nx = imgs.entries[0].nx();
const int ny = imgs.entries[0].ny();
const int n = nx * ny;
for (int b = 0; b < n_batch_cur; b++) {
LOG_DBG("%s: copying image %d/%d to input buffer (nx=%d, ny=%d)\n", __func__, b+1, n_batch_cur, nx, ny);
const auto & buf = imgs.entries[b]->get_ro_buf();
const auto & buf = imgs.entries[b].get_ro_buf();
float * batch_entry = inp_raw.data() + b * (3*n);
for (int y = 0; y < ny; y++) {
for (int x = 0; x < nx; x++) {
@@ -3616,9 +3545,9 @@ bool clip_image_batch_encode(clip_ctx * ctx, const int n_threads, const clip_ima
GGML_ASSERT(imgs.entries.size() == 1);
const auto & mel_inp = imgs.entries[0];
const auto & buf = mel_inp->get_ro_buf();
const int n_step = mel_inp->nx();
const int n_mel = mel_inp->ny();
const auto & buf = mel_inp.get_ro_buf();
const int n_step = mel_inp.nx();
const int n_mel = mel_inp.ny();
GGML_ASSERT((size_t)n_step * n_mel == buf.size());
set_input_f32("inp_raw", buf);
@@ -4232,7 +4161,7 @@ bool clip_image_batch_encode(clip_ctx * ctx, const int n_threads, const clip_ima
GGML_ASSERT(imgs.entries.size() == 1);
const auto & img0 = imgs.entries.front();
// Compute n_pos matching SSCP output: two stride-2 convs
int n_pos = img0->nx();
int n_pos = img0.nx();
for (int i = 0; i < 2; i++) { n_pos = (n_pos - 1) / 2 + 1; }
// Chunked local attention: blocked causal mask and RPE
@@ -4280,7 +4209,7 @@ bool clip_image_batch_encode(clip_ctx * ctx, const int n_threads, const clip_ima
case PROJECTOR_TYPE_LFM2A:
{
GGML_ASSERT(imgs.entries.size() == 1);
const auto n_frames = clip_n_output_tokens(ctx, imgs.entries.front().get());
const auto n_frames = clip_n_output_tokens(ctx, &imgs.entries.front());
auto d_model = 512;
auto seq_len = n_frames * 2 - 1;
@@ -4338,7 +4267,7 @@ bool clip_image_batch_encode(clip_ctx * ctx, const int n_threads, const clip_ima
// reshapes as ggml_get_rows gathers. The names are set
// by g4v_gather() in models/granite4-vision.cpp.
const int patch_size = model.hparams.patch_size;
const int image_side = imgs.entries.front()->nx() / patch_size;
const int image_side = imgs.entries.front().nx() / patch_size;
const int window_side = hparams.downsample_window_side;
const int query_side = hparams.downsample_query_side;
const int n = image_side / window_side;
@@ -4432,7 +4361,7 @@ bool clip_image_batch_encode(clip_ctx * ctx, const int n_threads, const clip_ima
// sanity check (assuming that all images in batch have the same number of tokens, so we only check the first one)
const int n_tokens_out = embeddings->ne[1];
const int expected_n_tokens_out = clip_n_output_tokens(ctx, imgs.entries[0].get());
const int expected_n_tokens_out = clip_n_output_tokens(ctx, &imgs.entries[0]);
if (n_tokens_out != expected_n_tokens_out) {
LOG_ERR("%s: expected output %d tokens, got %d\n", __func__, expected_n_tokens_out, n_tokens_out);
GGML_ABORT("Invalid number of output tokens");
+5 -26
View File
@@ -29,7 +29,6 @@ struct clip_image_size {
};
struct clip_image_f32;
struct clip_image_u8_batch;
struct clip_image_f32_batch;
enum clip_modality {
@@ -63,41 +62,21 @@ struct clip_init_result clip_init(const char * fname, struct clip_context_params
void clip_free(struct clip_ctx * ctx);
int32_t clip_get_image_size (const struct clip_ctx * ctx);
int32_t clip_get_patch_size (const struct clip_ctx * ctx);
int32_t clip_get_hidden_size(const struct clip_ctx * ctx);
// TODO: should be enum, not string
const char * clip_patch_merge_type(const struct clip_ctx * ctx);
int clip_n_output_tokens(const struct clip_ctx * ctx, struct clip_image_f32 * img);
int clip_n_output_tokens(const clip_ctx * ctx, const clip_image_f32 * img);
// for M-RoPE, this will be the number of token positions in X and Y directions
// for other models, X will be the total number of tokens and Y will be 1
int clip_n_output_tokens_x(const struct clip_ctx * ctx, struct clip_image_f32 * img);
int clip_n_output_tokens_y(const struct clip_ctx * ctx, struct clip_image_f32 * img);
int clip_n_output_tokens_x(const clip_ctx * ctx, const clip_image_f32 * img);
int clip_n_output_tokens_y(const clip_ctx * ctx, const clip_image_f32 * img);
// this should be equal to the embedding dimension of the text model
int clip_n_mmproj_embd(const struct clip_ctx * ctx);
struct clip_image_size * clip_image_size_init(void);
struct clip_image_u8 * clip_image_u8_init (void);
struct clip_image_f32 * clip_image_f32_init(void);
struct clip_image_f32_batch * clip_image_f32_batch_init(void); // only used by libllava
void clip_image_size_free (struct clip_image_size * img_size);
void clip_image_u8_free (struct clip_image_u8 * img);
void clip_image_f32_free(struct clip_image_f32 * img);
void clip_image_u8_batch_free (struct clip_image_u8_batch * batch);
void clip_image_f32_batch_free(struct clip_image_f32_batch * batch);
// use for accessing underlay data of clip_image_f32_batch
size_t clip_image_f32_batch_n_images(const struct clip_image_f32_batch * batch); // equivalent to batch->size()
size_t clip_image_f32_batch_nx(const struct clip_image_f32_batch * batch, int idx); // equivalent to batch[idx]->nx
size_t clip_image_f32_batch_ny(const struct clip_image_f32_batch * batch, int idx); // equivalent to batch[idx]->ny
struct clip_image_f32 * clip_image_f32_get_img(const struct clip_image_f32_batch * batch, int idx); // equivalent to batch[idx]->data
bool clip_image_encode (struct clip_ctx * ctx, int n_threads, struct clip_image_f32 * img, std::vector<float> & out_vec);
// TODO: remove clip_image_encode() and always use batched version
bool clip_image_encode (struct clip_ctx * ctx, int n_threads, const clip_image_f32 * img, std::vector<float> & out_vec);
bool clip_image_batch_encode(struct clip_ctx * ctx, int n_threads, const struct clip_image_f32_batch * imgs, std::vector<float> & out_batch_embd);
bool clip_is_llava(const struct clip_ctx * ctx);
+111 -111
View File
@@ -4,17 +4,33 @@
#include <cmath>
#include <vector>
//
// base implementation
//
void mtmd_image_preprocessor::img_u8_to_f32(const clip_image_u8 & src, clip_image_f32 & dst, const float mean[3], const float std[3]) {
dst.from_u8(src);
dst.normalize(mean, std);
void mtmd_image_preproc_out::append(const clip_hparams & hparams, const clip_image_u8 & img, bool normalized) {
clip_image_f32 dst;
dst.from_u8(img);
if (normalized) {
dst.normalize(hparams.image_mean, hparams.image_std);
}
entries.push_back(std::move(dst));
}
void mtmd_image_preprocessor::img_u8_to_f32(const clip_image_u8 & src, clip_image_f32 & dst) {
dst.from_u8(src);
void mtmd_image_preproc_out::append(const clip_hparams & hparams, const std::vector<clip_image_u8> & imgs, bool normalized) {
for (const auto & img : imgs) {
append(hparams, img, normalized);
}
}
void mtmd_image_preproc_out::append(const clip_hparams & hparams, clip_image_f32 & img, bool normalized) {
if (normalized) {
img.normalize(hparams.image_mean, hparams.image_std);
}
entries.push_back(std::move(img));
}
void mtmd_image_preproc_out::append_overview(const clip_hparams & hparams, const clip_image_u8 & img, bool normalized) {
overview.from_u8(img);
if (normalized) {
overview.normalize(hparams.image_mean, hparams.image_std);
}
}
// set of tools to manipulate images
@@ -595,21 +611,18 @@ private:
// mtmd_image_preprocessor_llava_uhd
//
bool mtmd_image_preprocessor_llava_uhd::preprocess(const clip_image_u8 & img, clip_image_f32_batch & output) {
mtmd_image_preproc_out mtmd_image_preprocessor_llava_uhd::preprocess(const clip_image_u8 & img) {
const clip_image_size original_size = img.get_size();
auto const inst = get_slice_instructions(original_size);
std::vector<clip_image_u8_ptr> imgs = slice_image(img, inst);
for (size_t i = 0; i < imgs.size(); ++i) {
// clip_image_save_to_bmp(*imgs[i], "slice_" + std::to_string(i) + ".bmp");
clip_image_f32_ptr res(clip_image_f32_init());
img_u8_to_f32(*imgs[i], *res, hparams.image_mean, hparams.image_std);
output.entries.push_back(std::move(res));
}
auto sliced = slice_image(img, inst);
mtmd_image_preproc_out output;
output.append_overview(hparams, sliced.overview, true);
output.append(hparams, sliced.slices, true);
output.grid_x = inst.grid_size.width;
output.grid_y = inst.grid_size.height;
return true;
return output;
}
mtmd_image_preprocessor_llava_uhd::slice_instructions mtmd_image_preprocessor_llava_uhd::get_slice_instructions(const clip_image_size & original_size) {
@@ -717,28 +730,21 @@ mtmd_image_preprocessor_llava_uhd::slice_instructions mtmd_image_preprocessor_ll
return res;
}
std::vector<clip_image_u8_ptr> mtmd_image_preprocessor_llava_uhd::slice_image(const clip_image_u8 & img, const mtmd_image_preprocessor_llava_uhd::slice_instructions & inst, bool overview_first) {
std::vector<clip_image_u8_ptr> output;
mtmd_image_preprocessor_llava_uhd::slice_output mtmd_image_preprocessor_llava_uhd::slice_image(const clip_image_u8 & img, const mtmd_image_preprocessor_llava_uhd::slice_instructions & inst) {
slice_output output;
// resize to overview size
clip_image_u8_ptr resized_img(clip_image_u8_init());
img_tool::resize(img, *resized_img, inst.overview_size, hparams.image_resize_algo_ov,
img_tool::resize(img, output.overview, inst.overview_size, hparams.image_resize_algo_ov,
hparams.image_pad_ov, hparams.image_pad_color_ov);
if (overview_first) {
output.push_back(std::move(resized_img));
}
if (inst.slices.empty()) {
// no slices, just return the resized image
if (!overview_first) {
output.push_back(std::move(resized_img));
}
// no slices, just return the overview image
return output;
}
// resize to refined size
clip_image_u8_ptr refined_img(clip_image_u8_init());
img_tool::resize(img, *refined_img, inst.refined_size, hparams.image_resize_algo_rf,
clip_image_u8 refined_img;
img_tool::resize(img, refined_img, inst.refined_size, hparams.image_resize_algo_rf,
hparams.image_pad_rf, hparams.image_pad_color_rf);
// create slices
@@ -748,13 +754,9 @@ std::vector<clip_image_u8_ptr> mtmd_image_preprocessor_llava_uhd::slice_image(co
int w = slice.size.width;
int h = slice.size.height;
clip_image_u8_ptr img_slice(clip_image_u8_init());
img_tool::crop(*refined_img, *img_slice, x, y, w, h);
output.push_back(std::move(img_slice));
}
if (!overview_first) {
output.push_back(std::move(resized_img));
clip_image_u8 img_slice;
img_tool::crop(refined_img, img_slice, x, y, w, h);
output.slices.push_back(std::move(img_slice));
}
return output;
@@ -871,24 +873,23 @@ clip_image_size mtmd_image_preprocessor_llava_uhd::get_best_grid(const int max_s
// mtmd_image_preprocessor_fixed_size
//
bool mtmd_image_preprocessor_fixed_size::preprocess(const clip_image_u8 & img, clip_image_f32_batch & output) {
mtmd_image_preproc_out mtmd_image_preprocessor_fixed_size::preprocess(const clip_image_u8 & img) {
clip_image_u8 resized_image;
int sz = hparams.image_size;
img_tool::resize(img, resized_image, {sz, sz},
hparams.image_resize_algo,
hparams.image_resize_pad,
hparams.image_pad_color);
clip_image_f32_ptr img_f32(clip_image_f32_init());
img_u8_to_f32(resized_image, *img_f32, hparams.image_mean, hparams.image_std);
output.entries.push_back(std::move(img_f32));
return true;
mtmd_image_preproc_out output;
output.append(hparams, resized_image, true);
return output;
}
//
// mtmd_image_preprocessor_dyn_size
//
bool mtmd_image_preprocessor_dyn_size::preprocess(const clip_image_u8 & img, clip_image_f32_batch & output) {
mtmd_image_preproc_out mtmd_image_preprocessor_dyn_size::preprocess(const clip_image_u8 & img) {
GGML_ASSERT(hparams.image_min_pixels > 0 && hparams.image_max_pixels > 0);
clip_image_u8 resized_image;
const clip_image_size original_size = img.get_size();
@@ -903,17 +904,16 @@ bool mtmd_image_preprocessor_dyn_size::preprocess(const clip_image_u8 & img, cli
hparams.image_resize_algo,
hparams.image_resize_pad,
hparams.image_pad_color);
clip_image_f32_ptr img_f32(clip_image_f32_init());
img_u8_to_f32(resized_image, *img_f32, hparams.image_mean, hparams.image_std);
output.entries.push_back(std::move(img_f32));
return true;
mtmd_image_preproc_out output;
output.append(hparams, resized_image, true);
return output;
}
//
// mtmd_image_preprocessor_longest_edge
//
bool mtmd_image_preprocessor_longest_edge::preprocess(const clip_image_u8 & img, clip_image_f32_batch & output) {
mtmd_image_preproc_out mtmd_image_preprocessor_longest_edge::preprocess(const clip_image_u8 & img) {
GGML_ASSERT(hparams.image_longest_edge > 0);
clip_image_u8 resized_image;
const clip_image_size original_size = img.get_size();
@@ -927,10 +927,9 @@ bool mtmd_image_preprocessor_longest_edge::preprocess(const clip_image_u8 & img,
hparams.image_resize_algo,
hparams.image_resize_pad,
hparams.image_pad_color);
clip_image_f32_ptr img_f32(clip_image_f32_init());
img_u8_to_f32(resized_image, *img_f32, hparams.image_mean, hparams.image_std);
output.entries.push_back(std::move(img_f32));
return true;
mtmd_image_preproc_out output;
output.append(hparams, resized_image, true);
return output;
}
//
@@ -1040,7 +1039,7 @@ clip_image_size mtmd_image_preprocessor_lfm2::get_grid_layout(int height, int wi
// mtmd_image_preprocessor_idefics3
//
bool mtmd_image_preprocessor_idefics3::preprocess(const clip_image_u8 & img, clip_image_f32_batch & output) {
mtmd_image_preproc_out mtmd_image_preprocessor_idefics3::preprocess(const clip_image_u8 & img) {
// The refined size has two steps:
// 1. Resize w/ aspect-ratio preserving such that the longer side is
// the preprocessor longest size
@@ -1075,44 +1074,40 @@ bool mtmd_image_preprocessor_idefics3::preprocess(const clip_image_u8 & img, cli
});
}
}
auto imgs = slice_image(img, instructions);
// cast and normalize to f32
for (size_t i = 0; i < imgs.size(); ++i) {
// clip_image_save_to_bmp(*imgs[i], "slice_" + std::to_string(i) + ".bmp");
clip_image_f32_ptr res(clip_image_f32_init());
img_u8_to_f32(*imgs[i], *res, hparams.image_mean, hparams.image_std);
output.entries.push_back(std::move(res));
}
auto sliced = slice_image(img, instructions);
mtmd_image_preproc_out output;
output.append_overview(hparams, sliced.overview, true);
output.append(hparams, sliced.slices, true);
output.grid_x = instructions.grid_size.width;
output.grid_y = instructions.grid_size.height;
return true;
return output;
}
//
// mtmd_image_preprocessor_internvl
//
bool mtmd_image_preprocessor_internvl::preprocess(const clip_image_u8 & img, clip_image_f32_batch & output) {
mtmd_image_preproc_out mtmd_image_preprocessor_internvl::preprocess(const clip_image_u8 & img) {
GGML_ASSERT(!hparams.image_res_candidates.empty());
const clip_image_size original_size = img.get_size();
auto const inst = get_slice_instructions(original_size);
std::vector<clip_image_u8_ptr> imgs = slice_image(img, inst, false);
auto sliced = slice_image(img, inst);
for (size_t i = 0; i < imgs.size(); ++i) {
clip_image_f32_ptr res(clip_image_f32_init());
img_u8_to_f32(*imgs[i], *res, hparams.image_mean, hparams.image_std);
output.entries.push_back(std::move(res));
}
return true;
mtmd_image_preproc_out output;
// InternVL: slices first, then overview
output.append(hparams, sliced.slices, true);
output.append_overview(hparams, sliced.overview, true);
output.grid_x = inst.grid_size.width;
output.grid_y = inst.grid_size.height;
return output;
}
//
// mtmd_image_preprocessor_deepseekocr
//
bool mtmd_image_preprocessor_deepseekocr::preprocess(const clip_image_u8 & img, clip_image_f32_batch & output) {
mtmd_image_preproc_out mtmd_image_preprocessor_deepseekocr::preprocess(const clip_image_u8 & img) {
static constexpr int native_resolutions[] = { 1024 /* base */, 1280 /* large */ };
// TODO: support 512 (tiny) and 640 (small) once we have eval data for them
@@ -1135,14 +1130,12 @@ bool mtmd_image_preprocessor_deepseekocr::preprocess(const clip_image_u8 & img,
clip_image_u8 padded;
img_tool::resize(img, padded, {image_size, image_size}, RESIZE_ALGO_BICUBIC_PILLOW,
PAD_NEAREST, hparams.image_pad_color);
clip_image_f32_ptr res(clip_image_f32_init());
img_u8_to_f32(padded, *res, hparams.image_mean, hparams.image_std);
output.entries.push_back(std::move(res));
output.grid_x = 1;
output.grid_y = 1;
return true;
mtmd_image_preproc_out output;
output.append_overview(hparams, padded, true);
output.grid_x = 0;
output.grid_y = 0;
// TODO @ngxson : support slicing for DeepSeek-OCR, to do in another PR
return output;
}
//
@@ -1205,10 +1198,11 @@ clip_image_size mtmd_image_preprocessor_deepseekocr2::find_closest_aspect_ratio(
return best_ratio;
}
bool mtmd_image_preprocessor_deepseekocr2::preprocess(const clip_image_u8 & img, clip_image_f32_batch & output) {
mtmd_image_preproc_out mtmd_image_preprocessor_deepseekocr2::preprocess(const clip_image_u8 & img) {
// emit 768x768 local tiles when the image is larger than a tile in either
// dimension, then always a 1024x1024 global view. order: [tiles..., global].
mtmd_image_preproc_out output;
const auto img_size = img.get_size();
if (img_size.width > tile_size || img_size.height > tile_size) {
const float aspect_ratio = static_cast<float>(img_size.width) / img_size.height;
@@ -1224,9 +1218,7 @@ bool mtmd_image_preprocessor_deepseekocr2::preprocess(const clip_image_u8 & img,
for (int col = 0; col < grid.width; col++) {
clip_image_u8 tile;
img_tool::crop(refined, tile, col * tile_size, row * tile_size, tile_size, tile_size);
clip_image_f32_ptr res(clip_image_f32_init());
img_u8_to_f32(tile, *res, hparams.image_mean, hparams.image_std);
output.entries.push_back(std::move(res));
output.append(hparams, tile, true);
}
}
}
@@ -1235,14 +1227,9 @@ bool mtmd_image_preprocessor_deepseekocr2::preprocess(const clip_image_u8 & img,
clip_image_u8 padded;
img_tool::resize(img, padded, { base_size, base_size }, RESIZE_ALGO_BICUBIC_PILLOW,
PAD_NEAREST, hparams.image_pad_color);
clip_image_f32_ptr global(clip_image_f32_init());
img_u8_to_f32(padded, *global, hparams.image_mean, hparams.image_std);
global->add_viewsep = true;
output.entries.push_back(std::move(global));
output.grid_x = 1;
output.grid_y = 1;
return true;
output.append_overview(hparams, padded, true);
output.overview.add_viewsep = true;
return output;
}
//
@@ -1258,7 +1245,8 @@ void mtmd_image_preprocessor_step3vl::img_u8_resize_bilinear_to_f32(
const float std[3]) {
const auto src_size = src.get_size();
if (src_size.width == target_width && src_size.height == target_height) {
img_u8_to_f32(src, dst, mean, std);
dst.from_u8(src);
dst.normalize(mean, std);
return;
}
@@ -1453,24 +1441,24 @@ mtmd_image_preprocessor_step3vl::slice_instructions mtmd_image_preprocessor_step
return instructions;
}
bool mtmd_image_preprocessor_step3vl::preprocess(const clip_image_u8 & img, clip_image_f32_batch & output) {
mtmd_image_preproc_out mtmd_image_preprocessor_step3vl::preprocess(const clip_image_u8 & img) {
clip_image_u8 prepared = prepare_image(img, hparams);
const auto instructions = build_slice_instructions(hparams, prepared.get_size());
clip_image_f32_ptr overview_f32(clip_image_f32_init());
mtmd_image_preproc_out output;
// overview (normalized f32, already includes mean/std)
img_u8_resize_bilinear_to_f32(
prepared,
*overview_f32,
output.overview,
hparams.image_size,
hparams.image_size,
hparams.image_mean,
hparams.image_std);
output.entries.push_back(std::move(overview_f32));
if (instructions.slices.empty()) {
output.grid_x = 0;
output.grid_y = 0;
return true;
return output;
}
clip_image_u8 img_for_crop = prepared;
@@ -1486,28 +1474,28 @@ bool mtmd_image_preprocessor_step3vl::preprocess(const clip_image_u8 & img, clip
// If the requested patch extends past the source image, pad the out-of-bounds area with black.
clip_image_u8 patch = crop_with_black_padding(img_for_crop, slice.x, slice.y, slice.size.width, slice.size.height);
clip_image_f32_ptr patch_f32(clip_image_f32_init());
clip_image_f32 patch_f32;
img_u8_resize_bilinear_to_f32(
patch,
*patch_f32,
patch_f32,
crop_size,
crop_size,
hparams.image_mean,
hparams.image_std);
output.entries.push_back(std::move(patch_f32));
output.append(hparams, patch_f32, false);
}
output.grid_x = instructions.grid_size.width;
output.grid_y = instructions.grid_size.height;
return true;
return output;
}
//
// mtmd_image_preprocessor_youtuvl
//
bool mtmd_image_preprocessor_youtuvl::preprocess(const clip_image_u8 & img, clip_image_f32_batch & output) {
mtmd_image_preproc_out mtmd_image_preprocessor_youtuvl::preprocess(const clip_image_u8 & img) {
const int patch_size = hparams.patch_size; // typically 16
const int merge_size = hparams.n_merge; // typically 2
const int align_size = patch_size * merge_size; // 32
@@ -1551,10 +1539,22 @@ bool mtmd_image_preprocessor_youtuvl::preprocess(const clip_image_u8 & img, clip
clip_image_u8 resized;
img_tool::resize(img, resized, new_size, hparams.image_resize_algo, hparams.image_resize_pad);
// Normalize to float32
clip_image_f32_ptr img_f32(clip_image_f32_init());
img_u8_to_f32(resized, *img_f32, hparams.image_mean, hparams.image_std);
// Add to results
output.entries.push_back(std::move(img_f32));
return true;
mtmd_image_preproc_out output;
output.append(hparams, resized, true);
return output;
}
mtmd_image_preproc_out mtmd_image_preprocessor_granite::preprocess(const clip_image_u8 & img) {
auto output = mtmd_image_preprocessor_llava_uhd::preprocess(img);
if (output.entries.size() == 0) {
// Single-tile (overview only): append one newline row.
output.overview.add_newline = true;
} else {
// Multi-tile: overview gets no newline, grid tiles get one.
output.overview.add_newline = false;
for (size_t i = 0; i < output.entries.size(); ++i) {
output.entries[i].add_newline = true;
}
}
return output;
}
+42 -15
View File
@@ -8,6 +8,24 @@
#define MTMD_INTERNAL_HEADER
struct mtmd_image_preproc_out {
std::vector<clip_image_f32> entries;
// grid size is required for llava-uhd style models
clip_image_f32 overview; // overview image (downscaled image)
int grid_x = 0;
int grid_y = 0;
void append(const clip_hparams & hparams, const clip_image_u8 & img, bool normalized = true);
void append(const clip_hparams & hparams, const std::vector<clip_image_u8> & imgs, bool normalized = true);
void append(const clip_hparams & hparams, clip_image_f32 & img, bool normalized = true);
void append_overview(const clip_hparams & hparams, const clip_image_u8 & img, bool normalized = true);
bool has_overview() const {
return overview.nx() > 0 || overview.ny() > 0;
}
};
// base class, models must inherit from this class
struct mtmd_image_preprocessor {
const clip_hparams & hparams;
@@ -15,10 +33,7 @@ struct mtmd_image_preprocessor {
mtmd_image_preprocessor(const clip_ctx * ctx): hparams(*clip_get_hparams(ctx)) {}
virtual ~mtmd_image_preprocessor() = default;
virtual bool preprocess(const clip_image_u8 & img, clip_image_f32_batch & output) = 0;
void img_u8_to_f32(const clip_image_u8 & src, clip_image_f32 & dst, const float mean[3], const float std[3]);
void img_u8_to_f32(const clip_image_u8 & src, clip_image_f32 & dst);
virtual mtmd_image_preproc_out preprocess(const clip_image_u8 & img) = 0;
};
/**
@@ -39,10 +54,12 @@ struct mtmd_image_preprocessor {
* [overview] --> [slice 1] --> [slice 2]
* | |
* +--> [slice 3] --> [slice 4]
*
* NOTE: for the ordering of overview, set "ov_img_first" on the mtmd_context
*/
struct mtmd_image_preprocessor_llava_uhd : mtmd_image_preprocessor {
mtmd_image_preprocessor_llava_uhd(const clip_ctx * ctx) : mtmd_image_preprocessor(ctx) {}
bool preprocess(const clip_image_u8 & img, clip_image_f32_batch & output) override;
mtmd_image_preproc_out preprocess(const clip_image_u8 & img) override;
struct slice_coordinates {
int x;
@@ -60,7 +77,11 @@ struct mtmd_image_preprocessor_llava_uhd : mtmd_image_preprocessor {
// LFM2 override this function to implement its custom slicing logic
virtual slice_instructions get_slice_instructions(const clip_image_size & original_size);
std::vector<clip_image_u8_ptr> slice_image(const clip_image_u8 & img, const slice_instructions & inst, bool overview_first = true);
struct slice_output {
clip_image_u8 overview;
std::vector<clip_image_u8> slices;
};
slice_output slice_image(const clip_image_u8 & img, const slice_instructions & inst);
private:
clip_image_size get_best_resize(const clip_image_size & original_size, int scale_resolution, int patch_size, bool allow_upscale = false);
@@ -91,7 +112,7 @@ private:
// downscale or upscale the input image to fixed size
struct mtmd_image_preprocessor_fixed_size : mtmd_image_preprocessor {
mtmd_image_preprocessor_fixed_size(const clip_ctx * ctx) : mtmd_image_preprocessor(ctx) {}
bool preprocess(const clip_image_u8 & img, clip_image_f32_batch & output) override;
mtmd_image_preproc_out preprocess(const clip_image_u8 & img) override;
};
// resize image to multiple of patch_size*n_merge, while preserving aspect ratio
@@ -99,13 +120,13 @@ struct mtmd_image_preprocessor_fixed_size : mtmd_image_preprocessor {
// this is used by models with native support for dynamic image size, for example: Qwen-VL, Pixtral, Kimi-VL, etc
struct mtmd_image_preprocessor_dyn_size : mtmd_image_preprocessor {
mtmd_image_preprocessor_dyn_size(const clip_ctx * ctx) : mtmd_image_preprocessor(ctx) {}
bool preprocess(const clip_image_u8 & img, clip_image_f32_batch & output) override;
mtmd_image_preproc_out preprocess(const clip_image_u8 & img) override;
};
// similar to mtmd_image_preprocessor_dyn_size, but resize the image to have longest edge equal to hparams.image_longest_edge, while preserving aspect ratio
struct mtmd_image_preprocessor_longest_edge : mtmd_image_preprocessor {
mtmd_image_preprocessor_longest_edge(const clip_ctx * ctx) : mtmd_image_preprocessor(ctx) {}
bool preprocess(const clip_image_u8 & img, clip_image_f32_batch & output) override;
mtmd_image_preproc_out preprocess(const clip_image_u8 & img) override;
};
// custom llava-uhd slicing logic for LFM2
@@ -131,17 +152,17 @@ private:
struct mtmd_image_preprocessor_idefics3 : mtmd_image_preprocessor_llava_uhd {
mtmd_image_preprocessor_idefics3(const clip_ctx * ctx) : mtmd_image_preprocessor_llava_uhd(ctx) {}
bool preprocess(const clip_image_u8 & img, clip_image_f32_batch & output) override;
mtmd_image_preproc_out preprocess(const clip_image_u8 & img) override;
};
struct mtmd_image_preprocessor_internvl : mtmd_image_preprocessor_llava_uhd {
mtmd_image_preprocessor_internvl(const clip_ctx * ctx) : mtmd_image_preprocessor_llava_uhd(ctx) {}
bool preprocess(const clip_image_u8 & img, clip_image_f32_batch & output) override;
mtmd_image_preproc_out preprocess(const clip_image_u8 & img) override;
};
struct mtmd_image_preprocessor_deepseekocr : mtmd_image_preprocessor {
mtmd_image_preprocessor_deepseekocr(const clip_ctx * ctx) : mtmd_image_preprocessor(ctx) {}
bool preprocess(const clip_image_u8 & img, clip_image_f32_batch & output) override;
mtmd_image_preproc_out preprocess(const clip_image_u8 & img) override;
};
// DeepSeek-OCR-2: a 1024x1024 global view, plus InternVL-style 768x768 local
@@ -153,7 +174,7 @@ struct mtmd_image_preprocessor_deepseekocr2 : mtmd_image_preprocessor {
static constexpr int max_tiles = 6;
mtmd_image_preprocessor_deepseekocr2(const clip_ctx * ctx) : mtmd_image_preprocessor(ctx) {}
bool preprocess(const clip_image_u8 & img, clip_image_f32_batch & output) override;
mtmd_image_preproc_out preprocess(const clip_image_u8 & img) override;
private:
static std::vector<clip_image_size> get_target_ratios();
@@ -168,7 +189,7 @@ private:
// ref: https://huggingface.co/stepfun-ai/Step3-VL-10B/blob/main/processing_step3.py
struct mtmd_image_preprocessor_step3vl : mtmd_image_preprocessor_llava_uhd {
mtmd_image_preprocessor_step3vl(const clip_ctx * ctx) : mtmd_image_preprocessor_llava_uhd(ctx) {}
bool preprocess(const clip_image_u8 & img, clip_image_f32_batch & output) override;
mtmd_image_preproc_out preprocess(const clip_image_u8 & img) override;
static slice_instructions build_slice_instructions(const clip_hparams & params, const clip_image_size & prepared_size);
private:
@@ -195,5 +216,11 @@ private:
struct mtmd_image_preprocessor_youtuvl : mtmd_image_preprocessor {
mtmd_image_preprocessor_youtuvl(const clip_ctx * ctx) : mtmd_image_preprocessor(ctx) {}
bool preprocess(const clip_image_u8 & img, clip_image_f32_batch & output) override;
mtmd_image_preproc_out preprocess(const clip_image_u8 & img) override;
};
// similar to llava_uhd, but has add_newline
struct mtmd_image_preprocessor_granite : mtmd_image_preprocessor_llava_uhd {
mtmd_image_preprocessor_granite(const clip_ctx * ctx) : mtmd_image_preprocessor_llava_uhd(ctx) {}
mtmd_image_preproc_out preprocess(const clip_image_u8 & img) override;
};
+110 -110
View File
@@ -114,7 +114,7 @@ struct mtmd_image_tokens {
// true if one of entries in batch_f32 is a placeholder
bool is_placeholder() const {
for (const auto & entry : batch_f32.entries) {
if (entry->is_placeholder()) {
if (entry.is_placeholder()) {
return true;
}
}
@@ -147,7 +147,7 @@ struct mtmd_audio_tokens {
// true if one of entries in batch_f32 is a placeholder
bool is_placeholder() const {
for (const auto & entry : batch_f32.entries) {
if (entry->is_placeholder()) {
if (entry.is_placeholder()) {
return true;
}
}
@@ -516,6 +516,7 @@ struct mtmd_context {
LOG_WRN("%s: llama 4 vision is known to have degraded quality:\n"
" https://github.com/ggml-org/llama.cpp/pull/13282\n", __func__);
image_preproc = std::make_unique<mtmd_image_preprocessor_llava_uhd>(ctx_v);
ov_img_first = false;
} break;
case PROJECTOR_TYPE_STEP3VL:
{
@@ -539,6 +540,7 @@ struct mtmd_context {
img_beg = "<img>";
img_end = "</img>";
image_preproc = std::make_unique<mtmd_image_preprocessor_internvl>(ctx_v);
ov_img_first = false;
} break;
case PROJECTOR_TYPE_KIMIVL:
{
@@ -615,11 +617,13 @@ struct mtmd_context {
{
img_end = "\n"; // prevent empty batch on llama-server
image_preproc = std::make_unique<mtmd_image_preprocessor_deepseekocr>(ctx_v);
ov_img_first = false;
} break;
case PROJECTOR_TYPE_DEEPSEEKOCR2:
{
img_end = "\n"; // prevent empty batch on llama-server
image_preproc = std::make_unique<mtmd_image_preprocessor_deepseekocr2>(ctx_v);
ov_img_first = false;
} break;
case PROJECTOR_TYPE_HUNYUANVL:
{
@@ -639,7 +643,8 @@ struct mtmd_context {
{
img_beg = "<image>";
img_end = "";
image_preproc = std::make_unique<mtmd_image_preprocessor_llava_uhd>(ctx_v);
image_preproc = std::make_unique<mtmd_image_preprocessor_granite>(ctx_v);
ov_img_first = true;
} break;
default:
throw std::runtime_error(string_format("%s: unexpected vision projector type %d\n", __func__, proj));
@@ -1033,7 +1038,10 @@ struct mtmd_tokenizer {
int32_t add_media(std::vector<const mtmd_bitmap *> & bitmaps) {
GGML_ASSERT(!bitmaps.empty());
if (!bitmaps[0]->is_audio) {
// note: only one type of media is supported per call, caller should enforce this
const bool is_vision = !bitmaps[0]->is_audio;
if (is_vision) {
// handle image
if (!ctx->ctx_v) {
@@ -1047,7 +1055,7 @@ struct mtmd_tokenizer {
// TODO @ngxson : this is quite hacky because preprocessor only support batch with one single element, that need to be fixed in the future (e.g. by changing the preprocessor interface always take single input)
clip_image_f32_batch batch_f32;
mtmd_image_preproc_out preproc_out;
for (const auto * bmp : bitmaps) {
// sanity check
@@ -1060,66 +1068,54 @@ struct mtmd_tokenizer {
}
// convert mtmd_bitmap to clip_image_u8
clip_image_u8_ptr img_u8(clip_image_u8_init());
img_u8->set_size(
clip_image_u8 img_u8;
img_u8.set_size(
{(int)bmp->nx, (int)bmp->ny},
bmp->is_placeholder());
img_u8->cpy_buf(bmp->get_ro_buf());
img_u8.cpy_buf(bmp->get_ro_buf());
// preprocess image
clip_image_f32_batch tmp_batch;
bool ok = ctx->image_preproc->preprocess(*img_u8, tmp_batch);
if (!ok) {
LOG_ERR("Unable to preprocess image\n");
return 2;
}
mtmd_image_preproc_out tmp_preproc_out = ctx->image_preproc->preprocess(img_u8);
// move entries and grid dimensions to the "global" batch_f32
for (auto & entry : tmp_batch.entries) {
batch_f32.entries.emplace_back(std::move(entry));
// move entries and grid dimensions to the "global" preproc_out
for (auto & entry : tmp_preproc_out.entries) {
preproc_out.entries.emplace_back(std::move(entry));
}
// for llava-uhd style, we need to handle grid too
// we don't care about overwriting these values for now because llama-uhd doesn't support batching anyway
batch_f32.grid_x = tmp_batch.grid_x;
batch_f32.grid_y = tmp_batch.grid_y;
}
// Annotate llava-next style tiles so clip_n_output_tokens accounts
// for per-tile newline injection.
if (ctx->proj_type_v() == PROJECTOR_TYPE_GRANITE4_VISION) {
if (batch_f32.entries.size() == 1) {
// Single-tile (overview only): append one newline row.
batch_f32.entries[0]->add_newline = true;
} else {
// Multi-tile: overview gets no newline, grid tiles get one.
batch_f32.entries[0]->add_newline = false;
for (size_t i = 1; i < batch_f32.entries.size(); ++i) {
batch_f32.entries[i]->add_newline = true;
}
// we don't care about overwriting these values for now because the case where bitmaps.size() > 1 is only for frame merging (qwen-vl), not supported by llava-uhd
if ((tmp_preproc_out.grid_x > 0 && tmp_preproc_out.grid_y > 0)
|| tmp_preproc_out.has_overview()) {
GGML_ASSERT(bitmaps.size() == 1);
preproc_out.grid_x = tmp_preproc_out.grid_x;
preproc_out.grid_y = tmp_preproc_out.grid_y;
preproc_out.overview = std::move(tmp_preproc_out.overview);
}
}
LOG_DBG("%s: preproc_out has %zu entries, grid_x = %d, grid_y = %d, has_overview = %d\n",
__func__, preproc_out.entries.size(), preproc_out.grid_x, preproc_out.grid_y,
preproc_out.has_overview() ? 1 : 0);
// handle llava-uhd style preprocessing
const bool has_tiling_grid = batch_f32.grid_x > 0 && batch_f32.grid_y > 0;
if (
ctx->slice_tmpl == MTMD_SLICE_TMPL_MINICPMV_2_5
|| ctx->slice_tmpl == MTMD_SLICE_TMPL_MINICPMV_2_6
|| ctx->slice_tmpl == MTMD_SLICE_TMPL_LLAMA4
|| ctx->slice_tmpl == MTMD_SLICE_TMPL_IDEFICS3
|| ctx->slice_tmpl == MTMD_SLICE_TMPL_STEP3VL
|| (ctx->slice_tmpl == MTMD_SLICE_TMPL_LFM2 && has_tiling_grid)
) {
// (output either a grid, or overview-only)
const bool has_tiling_grid = (preproc_out.grid_x > 0 && preproc_out.grid_y > 0)
|| preproc_out.has_overview();
if (has_tiling_grid) {
// [QWEN_VIDEO] we do not support "frame merging" for llama-uhd style, so no batching for now
GGML_ASSERT(bitmaps.size() == 1);
const int n_col = batch_f32.grid_x;
const int n_row = batch_f32.grid_y;
const int n_col = preproc_out.grid_x;
const int n_row = preproc_out.grid_y;
// split batch into chunks of single images
// NOTE: batch_f32 will be invalidated after this call
auto chunks = split_batch_to_chunk(std::move(batch_f32), bitmaps[0]->id);
auto chunks = split_batch_to_chunk(std::move(preproc_out), bitmaps[0]->id);
GGML_ASSERT(chunks.size() > 0);
// NOTE: preproc_out is invalidated after this point, do not use it anymore
// split_batch_to_chunk must always put the overview image first
auto ov_chunk = std::move(chunks.front());
chunks.erase(chunks.begin());
@@ -1146,7 +1142,16 @@ struct mtmd_tokenizer {
std::snprintf(buf.get(), sz, ctx->sli_img_start_tmpl.c_str(), y+1, x+1);
add_text(std::string(buf.get(), buf.get() + sz - 1), true);
}
cur.entries.emplace_back(std::move(chunks[y * n_col + x]));
auto & curr_chunk = chunks[y * n_col + x];
auto & curr_batch = curr_chunk.tokens_image->batch_f32;
if (curr_batch.entries.size() != 1) {
throw std::runtime_error(string_format("%s: expect 1 image in batch_f32", __func__));
}
LOG_DBG("%s: adding slice image at row %d col %d\n", __func__, y, x);
cur.entries.emplace_back(std::move(curr_chunk));
add_text(ctx->tok_sli_img_end);
if (!is_last_in_row) {
add_text(ctx->tok_sli_img_mid);
@@ -1168,9 +1173,14 @@ struct mtmd_tokenizer {
} else {
if (preproc_out.entries.size() == 0) {
LOG_ERR("%s: no image tokens produced by preprocessor (ref: https://github.com/ggml-org/llama.cpp/pull/24769)\n", __func__);
return 2;
}
size_t n_tokens = 0;
for (const auto & e : batch_f32.entries) {
n_tokens += clip_n_output_tokens(ctx->ctx_v, e.get());
for (auto & e : preproc_out.entries) {
n_tokens += clip_n_output_tokens(ctx->ctx_v, &e);
if (clip_model_n_temporal_merge(ctx->ctx_v) == 2) {
// [QWEN_VIDEO] pair input is merged to the same embd, so only count as one image
break;
@@ -1184,8 +1194,8 @@ struct mtmd_tokenizer {
if (mtmd_decode_use_mrope(ctx)) {
// for Qwen2VL, we need this information for M-RoPE decoding positions
image_tokens->nx = clip_n_output_tokens_x(ctx->ctx_v, batch_f32.entries[0].get());
image_tokens->ny = clip_n_output_tokens_y(ctx->ctx_v, batch_f32.entries[0].get());
image_tokens->nx = clip_n_output_tokens_x(ctx->ctx_v, &preproc_out.entries[0]);
image_tokens->ny = clip_n_output_tokens_y(ctx->ctx_v, &preproc_out.entries[0]);
} else {
// other models, we only need the total number of tokens
image_tokens->nx = n_tokens;
@@ -1200,6 +1210,12 @@ struct mtmd_tokenizer {
image_tokens->image_idx = n_images_added;
GGML_ASSERT(n_tokens == (size_t)image_tokens->n_tokens());
}
clip_image_f32_batch batch_f32;
batch_f32.is_audio = false;
batch_f32.entries = std::move(preproc_out.entries);
// do NOT use preproc_out from this point on, it's moved
image_tokens->batch_f32 = std::move(batch_f32);
image_tokens->id = bitmaps[0]->id; // optional
@@ -1279,13 +1295,13 @@ struct mtmd_tokenizer {
for (auto & mel_spec : mel_spec_chunks) {
const bool is_placeholder = mel_spec.data.empty();
clip_image_f32_ptr mel_f32(clip_image_f32_init());
mel_f32->set_size(
clip_image_f32 mel_f32;
mel_f32.set_size(
{mel_spec.n_len, mel_spec.n_mel},
is_placeholder, /* is_audio */ true);
mel_f32->cpy_buf(mel_spec.data);
mel_f32.cpy_buf(mel_spec.data);
size_t n_tokens = clip_n_output_tokens(ctx->ctx_a, mel_f32.get());
size_t n_tokens = clip_n_output_tokens(ctx->ctx_a, &mel_f32);
clip_image_f32_batch batch_f32;
batch_f32.is_audio = true;
@@ -1315,16 +1331,18 @@ struct mtmd_tokenizer {
return 0;
}
std::vector<mtmd_input_chunk> split_batch_to_chunk(clip_image_f32_batch && batch_f32, const std::string & id) {
std::vector<mtmd_input_chunk> split_batch_to_chunk(mtmd_image_preproc_out && preproc_out, const std::string & id) {
std::vector<mtmd_input_chunk> chunks;
for (auto & entry : batch_f32.entries) {
auto process_chunk = [&](clip_image_f32 && img) {
mtmd_image_tokens_ptr image_tokens(new mtmd_image_tokens);
image_tokens->nx = clip_n_output_tokens(ctx->ctx_v, entry.get());
image_tokens->nx = clip_n_output_tokens(ctx->ctx_v, &img);
image_tokens->ny = 1;
image_tokens->batch_f32.entries.push_back(std::move(entry));
image_tokens->batch_f32.entries.push_back(std::move(img));
image_tokens->id = id;
GGML_ASSERT(image_tokens->nx > 0);
mtmd_input_chunk chunk{
MTMD_INPUT_CHUNK_TYPE_IMAGE,
{}, // text tokens
@@ -1332,6 +1350,21 @@ struct mtmd_tokenizer {
nullptr, // audio tokens
};
chunks.emplace_back(std::move(chunk));
};
// overview image first
auto & overview = preproc_out.overview;
if (overview.nx() == 0 || overview.ny() == 0) {
throw std::runtime_error(string_format("%s: invalid overview image for llava-uhd style preprocessing\n", __func__));
}
process_chunk(std::move(preproc_out.overview));
// then, process slices
for (auto & entry : preproc_out.entries) {
if (entry.nx() == 0 || entry.ny() == 0) {
throw std::runtime_error(string_format("%s: invalid image slice for llava-uhd style preprocessing\n", __func__));
}
process_chunk(std::move(entry));
}
return chunks;
@@ -1405,57 +1438,22 @@ static int32_t mtmd_encode_impl(mtmd_context * ctx, const mtmd_image_tokens * im
LOG_ERR("%s: this API does not support non-vision input, please use mtmd_encode_chunk instead\n", __func__);
return 1;
}
auto proj_type = clip_get_projector_type(ctx_clip);
int n_embd_out = ctx->n_embd_out();
auto n_tokens_out = image_tokens->n_tokens();
out_embd.resize((size_t)n_embd_out * n_tokens_out);
bool ok = false;
if (clip_is_llava(ctx_clip)
|| proj_type == PROJECTOR_TYPE_MINICPMV
|| proj_type == PROJECTOR_TYPE_GLM_EDGE
|| proj_type == PROJECTOR_TYPE_INTERNVL
|| proj_type == PROJECTOR_TYPE_DEEPSEEKOCR2
|| proj_type == PROJECTOR_TYPE_GRANITE4_VISION) {
// TODO @ngxson : llava does not support batched encoding ; this should be fixed inside clip_image_batch_encode()
const auto & entries = image_tokens->batch_f32.entries;
// entries may have different token counts
// e.g., DeepSeek-OCR-2: 144 per tile views, 257 for the global view
size_t offset = 0;
for (size_t i = 0; i < entries.size(); i++) {
if (entries[i]->is_placeholder()) {
LOG_ERR("%s: image tokens batch entry %zu is placeholder\n", __func__, i);
return 1;
}
int n_tokens_per_image = clip_n_output_tokens(ctx_clip, entries[i].get());
std::vector<float> tmp_embd((size_t)n_tokens_per_image * n_embd_out);
bool ok_i = clip_image_encode(
ctx_clip,
ctx->n_threads,
entries[i].get(),
tmp_embd);
if (!ok_i) {
LOG_ERR("%s: failed to encode image %zu\n", __func__, i);
return 1;
}
ok = true;
std::copy(tmp_embd.begin(), tmp_embd.end(), out_embd.begin() + offset);
offset += static_cast<size_t>(n_embd_out) * n_tokens_per_image;
}
} else {
if (image_tokens->is_placeholder()) {
LOG_ERR("%s: image tokens batch is placeholder\n", __func__);
return 1;
}
ok = clip_image_batch_encode(
ctx_clip,
ctx->n_threads,
&image_tokens->batch_f32,
out_embd);
if (image_tokens->is_placeholder()) {
LOG_ERR("%s: image tokens batch is placeholder\n", __func__);
return 1;
}
bool ok = clip_image_batch_encode(
ctx_clip,
ctx->n_threads,
&image_tokens->batch_f32,
out_embd);
return ok ? 0 : 1;
}
@@ -2082,16 +2080,18 @@ void mtmd_debug_preprocess_image(mtmd_context * ctx, const std::vector<uint8_t>
clip_image_u8 img_u8;
img_u8.set_size({nx, ny}, false);
img_u8.cpy_buf(rgb_values);
clip_image_f32_batch batch_f32;
GGML_ASSERT(ctx->image_preproc != nullptr);
bool ok = ctx->image_preproc->preprocess(img_u8, batch_f32);
if (!ok) {
LOG_ERR("%s: failed to preprocess image\n", __func__);
return;
mtmd_image_preproc_out preproc_out = ctx->image_preproc->preprocess(img_u8);
clip_image_f32_batch batch_f32;
batch_f32.is_audio = false;
for (auto & entry : preproc_out.entries) {
batch_f32.entries.push_back(std::move(entry));
}
LOG_INF("%s: preprocessed image to batch_f32 with %d entries\n", __func__, (int)batch_f32.entries.size());
for (size_t i = 0; i < batch_f32.entries.size(); i++) {
LOG_INF("%s: entry %zu has nx=%d, ny=%d\n", __func__, i, batch_f32.entries[i]->nx(), batch_f32.entries[i]->ny());
LOG_INF("%s: entry %zu has nx=%d, ny=%d\n", __func__, i, batch_f32.entries[i].nx(), batch_f32.entries[i].ny());
// TODO: better way to dump entry content?
}
}
+2
View File
@@ -17,6 +17,8 @@ add_library(${TARGET} STATIC
server-context.h
server-tools.cpp
server-tools.h
server-schema.cpp
server-schema.h
)
if (BUILD_SHARED_LIBS)
+18
View File
@@ -180,6 +180,24 @@ That requires `JSON.stringify` when formatted to message content:
}
```
### Model management API (router mode)
Model management API was added via PR [#23976](https://github.com/ggml-org/llama.cpp/pull/23976)
The main goal of this API is to allow downloading models and/or removing models from the web UI. It relies on the model cache infrastructure under the hood to manage the list of models dynamically.
Instead of building everything from the ground up (like what most AI agents will do when you ask them to implement a similar feature), we built on top of existing, already well-engineered components inside the codebase:
- Model cache infrastructure as mentioned above (`common/download.h`)
- Server response queue (`server-queue.h`). We use this feature to broadcast events to SSE clients.
- Server router thread management (`server-models.h`). We re-use the same thread model that is used for managing subprocess life cycle, except that we don't create a new subprocess, but launch the download right inside the thread.
The flow for downloading a new model:
- POST request comes in --> `post_router_models` --> validation
- `server_models::download()` is called
- Sets up a new thread `inst.th` and runs the download inside
- If a stop request comes in, set `stop_download` to `true`
- Otherwise, upon completion, we call `load_models()` to refresh the list of models
### Notable Related PRs
- Initial server implementation: https://github.com/ggml-org/llama.cpp/pull/1443
+115
View File
@@ -1778,6 +1778,20 @@ The `status` object can be:
}
```
Note: for "downloading" state, there can be multiple files be downloading in parallel
```json
"status": {
"value": "downloading",
"progress": {
"https://...model.gguf": {
"done": 195963406,
"total": 219307424
}
}
}
```
### POST `/models/load`: Load a model
Load a model
@@ -1820,6 +1834,107 @@ Response:
}
```
### GET `/models/sse`: Real-time events
Example events:
```js
{
"model": "...",
"event": "model_status",
"data": {
"status": "loading"
}
}
{
"model": "...",
"event": "download_progress",
"data": {
// note: there can be multiple files being downloaded in parallel
"https://...model.gguf": {
"done": 195963406,
"total": 219307424
}
}
}
{
"model": "...",
"event": "download_finished",
"data": {
"status": "loading"
}
}
{
"model": "...",
"event": "model_remove"
}
// special event: reload of the list of all models
{
"model": "*",
"event": "models_reload"
}
```
### POST `/models`: Download new model
Trigger a new download (non-blocking), the progress can be tracked via SSE endpoint `/models/sse`
To cancel model downloading, send an event to `/models/unload`
Download procedure:
- Send POST request to `/models`
- Subscribe to `/models/sse` for updates
- On downloading completed, you will receive either `download_finished` or `download_failed` event
- Call GET `/models` to trigger model list update. If the download success, you should see the new model in the list
Payload:
```json
{
"model": "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M",
}
```
Response (download is started in the background):
```json
{
"success": true
}
```
Response (error, cannot start the download):
```json
{
"error": {
"code": 400,
"message": "model validation failed, unable to download",
"type": "invalid_request_error"
}
}
```
### DELETE `/models`: Delete a model from cache
IMPORTANT: only model stored in cache can be deleted. You cannot delete models in a preset.
Model name must be passed via query param: `?model={name}`
If delete success, it will send an SSE event of type `model_remove`
Response:
```json
{
"success": true
}
```
## API errors
`llama-server` returns errors in the same format as OAI: https://github.com/openai/openai-openapi
+18 -12
View File
@@ -4,6 +4,7 @@
#include "server-http.h"
#include "server-task.h"
#include "server-queue.h"
#include "server-schema.h"
#include "build-info.h"
#include "common.h"
@@ -189,9 +190,10 @@ struct server_slot {
// stats
size_t n_sent_text = 0; // number of sent text character
int64_t t_print_last = 0;
int64_t t_start_process_prompt;
int64_t t_start_generation;
int64_t t_print_last = 0;
int32_t n_decoded_last = 0;
double t_prompt_processing = 0.0; // ms
double t_token_generation = 0.0; // ms
@@ -470,11 +472,13 @@ struct server_slot {
return;
}
const double n_gen_second = 1e3 / (t_token_generation) * (n_decoded);
const double n_gen_second_win = 1e6 / (t_now - t_print_last) * (n_decoded - n_decoded_last);
t_print_last = t_now;
n_decoded_last = n_decoded;
const double n_gen_second = 1e3 / t_token_generation * n_decoded;
SLT_INF(*this, "n_decoded = %6d, tg = %6.2f t/s\n", n_decoded, n_gen_second);
SLT_INF(*this, "n_decoded = %6d, tg = %6.2f t/s, tg_3s = %6.2f t/s\n", n_decoded, n_gen_second, n_gen_second_win);
}
void print_timings_pp() const {
@@ -3038,8 +3042,8 @@ private:
}
}
const int64_t t_current = ggml_time_us();
slot.t_prompt_processing = (t_current - slot.t_start_process_prompt) / 1e3;
const int64_t t_now = ggml_time_us();
slot.t_prompt_processing = (t_now - slot.t_start_process_prompt) / 1e3;
slot.print_timings_pp();
// truncate any tokens that are beyond n_past for this slot
@@ -3447,17 +3451,19 @@ private:
common_sampler_accept(slot.smpl.get(), id, true);
// here we have synchronized the llama_context (due to the sampling above), so we can do time measurement
const int64_t t_current = ggml_time_us();
const int64_t t_now = ggml_time_us();
slot.n_decoded += 1;
if (slot.n_decoded == 1) {
slot.t_start_generation = t_current;
slot.t_start_generation = t_now;
slot.t_print_last = t_now;
slot.n_decoded_last = 0;
slot.t_prompt_processing = (slot.t_start_generation - slot.t_start_process_prompt) / 1e3;
metrics.on_prompt_eval(slot);
}
slot.t_token_generation = std::max<int64_t>(1, t_current - slot.t_start_generation) / 1e3;
slot.t_token_generation = std::max<int64_t>(1, t_now - slot.t_start_generation) / 1e3;
completion_token_output result;
result.tok = id;
@@ -3551,11 +3557,11 @@ private:
slot.spec_draft = std::move(accepted);
}
const int64_t t_current = ggml_time_us();
const int64_t t_now = ggml_time_us();
const auto ids = std::move(slot.spec_draft);
slot.t_token_generation = std::max<int64_t>(1, t_current - slot.t_start_generation) / 1e3;
slot.t_token_generation = std::max<int64_t>(1, t_now - slot.t_start_generation) / 1e3;
// update how many tokens out of those tested were accepted
slot.n_draft_accepted += ids.size() - 1;
@@ -3820,7 +3826,7 @@ std::unique_ptr<server_res_generator> server_routes::handle_completions_impl(
task.id = rd.get_new_id();
task.tokens = std::move(inputs[i]);
task.params = server_task::params_from_json_cmpl(
task.params = server_schema::eval_llama_cmpl_schema(
ctx_server.vocab,
params,
meta->slot_n_ctx,
+17
View File
@@ -588,6 +588,23 @@ void server_http_context::post(const std::string & path, const server_http_conte
});
}
void server_http_context::del(const std::string & path, const server_http_context::handler_t & handler) const {
handlers.emplace(path, handler);
pimpl->srv->Delete(path_prefix + path, [handler](const httplib::Request & req, httplib::Response & res) {
server_http_req_ptr request = std::make_unique<server_http_req>(server_http_req{
get_params(req),
get_headers(req),
req.path,
build_query_string(req),
req.body,
{},
req.is_connection_closed
});
server_http_res_ptr response = handler(*request);
process_handler_response(std::move(request), response, res);
});
}
//
// Vertex AI Prediction protocol (AIP_PREDICT_ROUTE)
// https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements
+1
View File
@@ -86,6 +86,7 @@ struct server_http_context {
void get(const std::string & path, const handler_t & handler) const;
void post(const std::string & path, const handler_t & handler) const;
void del(const std::string & path, const handler_t & handler) const;
// Register the Google Cloud Platform (Vertex AI) compat (AIP_PREDICT_ROUTE env var, or /predict)
// Must be called AFTER all other API routes are registered
+429 -45
View File
@@ -9,6 +9,7 @@
#include <sheredom/subprocess.h>
#include <functional>
#include <optional>
#include <algorithm>
#include <thread>
#include <mutex>
@@ -51,6 +52,37 @@ extern char **environ;
// ref: https://github.com/ggml-org/llama.cpp/issues/17862
#define CHILD_ADDR "127.0.0.1"
struct server_subproc {
std::optional<subprocess_s> sproc; // empty while in DOWNLOADING state
std::atomic<bool> stopped{false}; // set to cancel a download or signal child process exit
subprocess_s & get() {
GGML_ASSERT(sproc.has_value() && "subprocess not initialized");
return sproc.value();
}
bool is_alive() {
return sproc.has_value() && subprocess_alive(&sproc.value());
}
void terminate() {
if (!sproc.has_value()) {
return;
}
#if defined(_WIN32)
if (sproc->hProcess == NULL) {
return;
}
#else
if (sproc->child <= 0) {
return;
}
#endif
subprocess_terminate(&sproc.value());
}
};
static std::filesystem::path get_server_exec_path() {
#if defined(_WIN32)
wchar_t buf[32768] = { 0 }; // Large buffer to handle long paths
@@ -272,12 +304,25 @@ void server_models::add_model(server_model_meta && meta) {
meta.update_caps();
std::string name = meta.name;
mapping[name] = instance_t{
/* subproc */ std::make_shared<subprocess_s>(),
/* subproc */ std::make_shared<server_subproc>(),
/* th */ std::thread(),
/* meta */ std::move(meta)
};
}
void server_models::notify_sse(const std::string & event, const std::string & model_id, const json & data) {
std::unique_ptr<server_task_result_router> result = std::make_unique<server_task_result_router>();
result->data = {
{"model", model_id},
{"event", event},
};
if (!data.is_null()) {
result->data["data"] = data;
}
SRV_DBG("notifying SSE clients about event '%s' for model '%s': %s\n", event.c_str(), model_id.c_str(), safe_json_to_str(result->data).c_str());
sse.broadcast(std::move(result));
}
void server_models::load_models() {
// Phase 1: load presets from all sources — pure I/O, no lock needed
// 1. cached models
@@ -304,20 +349,34 @@ void server_models::load_models() {
// note: if a model exists in both cached and local, local takes precedence
common_presets final_presets;
for (const auto & [name, preset] : cached_models) final_presets[name] = preset;
for (const auto & [name, preset] : local_models) final_presets[name] = preset;
std::unordered_map<std::string, server_model_source> source_map;
for (const auto & [name, preset] : cached_models) {
final_presets[name] = preset;
source_map[name] = SERVER_MODEL_SOURCE_CACHE;
}
for (const auto & [name, preset] : local_models) {
final_presets[name] = preset;
source_map[name] = SERVER_MODEL_SOURCE_MODELS_DIR;
}
for (const auto & [name, custom] : custom_presets) {
if (final_presets.find(name) != final_presets.end()) {
final_presets[name].merge(custom);
} else {
final_presets[name] = custom;
}
source_map[name] = SERVER_MODEL_SOURCE_PRESET;
}
// server base preset from CLI args takes highest precedence
// overlay router's own CLI args on top of every model preset so that
// e.g. `llama-server --temp 0` is honoured by all child processes
for (auto & [name, preset] : final_presets) {
preset.merge(base_preset);
}
auto get_source = [&](const std::string & name) {
return source_map.count(name) ? source_map.at(name) : SERVER_MODEL_SOURCE_PRESET;
};
// Helpers that read `mapping` — must be called while holding the lock.
std::unordered_set<std::string> custom_names;
for (const auto & [name, preset] : custom_presets) custom_names.insert(name);
@@ -366,12 +425,15 @@ void server_models::load_models() {
// (unload, load) or when joining threads (the monitoring thread calls update_status
// which locks the mutex, so joining while holding it would deadlock).
std::unique_lock<std::mutex> lk(mutex);
need_reload = false;
bool is_first_load = mapping.empty();
if (is_first_load) {
// FIRST LOAD: add all models, then unlock for autoloading
for (const auto & [name, preset] : final_presets) {
server_model_meta meta{
/* source */ get_source(name),
/* preset */ preset,
/* name */ name,
/* aliases */ {},
@@ -384,7 +446,7 @@ void server_models::load_models() {
/* exit_code */ 0,
/* stop_timeout */ DEFAULT_STOP_TIMEOUT,
/* multimodal */ mtmd_caps{false, false},
/* need_download */ false,
// /* need_download */ false,
};
add_model(std::move(meta));
}
@@ -453,6 +515,9 @@ void server_models::load_models() {
}
}
for (auto & [name, inst] : mapping) {
if (inst.meta.status == SERVER_MODEL_STATUS_DOWNLOADING) {
continue; // downloading models are not from config sources, leave them alone
}
if (final_presets.find(name) == final_presets.end() && !inst.meta.is_running() && inst.th.joinable()) {
threads_to_join.push_back(std::move(inst.th));
}
@@ -465,7 +530,15 @@ void server_models::load_models() {
// erase models no longer in any source
for (auto it = mapping.begin(); it != mapping.end(); ) {
if (final_presets.find(it->first) == final_presets.end()) {
if (it->second.meta.status == SERVER_MODEL_STATUS_DOWNLOADING) {
++it; // download thread is still busy, skip
} else if (it->second.meta.status == SERVER_MODEL_STATUS_DOWNLOADED) {
// download finished, safe to erase
if (it->second.th.joinable()) {
it->second.th.join();
}
it = mapping.erase(it);
} else if (final_presets.find(it->first) == final_presets.end()) {
SRV_INF("(reload) removing model name=%s (no longer in source)\n", it->first.c_str());
GGML_ASSERT(!it->second.th.joinable()); // must have been joined above
it = mapping.erase(it);
@@ -526,6 +599,7 @@ void server_models::load_models() {
for (const auto & [name, preset] : final_presets) {
if (mapping.find(name) == mapping.end()) {
server_model_meta meta{
/* source */ get_source(name),
/* preset */ preset,
/* name */ name,
/* aliases */ {},
@@ -538,7 +612,7 @@ void server_models::load_models() {
/* exit_code */ 0,
/* stop_timeout */ DEFAULT_STOP_TIMEOUT,
/* multimodal */ mtmd_caps{false, false},
/* need_download */ false,
// /* need_download */ false,
};
add_model(std::move(meta));
newly_added.push_back(name);
@@ -571,6 +645,8 @@ void server_models::load_models() {
SRV_INF("(reload) loading new model %s\n", name.c_str());
load(name);
}
notify_sse("models_reload", "*");
}
}
@@ -597,7 +673,13 @@ bool server_models::has_model(const std::string & name) {
}
std::optional<server_model_meta> server_models::get_meta(const std::string & name) {
std::lock_guard<std::mutex> lk(mutex);
std::unique_lock<std::mutex> lk(mutex);
if (need_reload) {
lk.unlock();
load_models();
lk.lock();
}
auto it = mapping.find(name);
if (it != mapping.end()) {
return it->second.meta;
@@ -683,7 +765,13 @@ static std::vector<char *> to_char_ptr_array(const std::vector<std::string> & ve
}
std::vector<server_model_meta> server_models::get_all_meta() {
std::lock_guard<std::mutex> lk(mutex);
std::unique_lock<std::mutex> lk(mutex);
if (need_reload) {
lk.unlock();
load_models();
lk.lock();
}
std::vector<server_model_meta> result;
result.reserve(mapping.size());
for (const auto & [name, inst] : mapping) {
@@ -770,7 +858,7 @@ void server_models::load(const std::string & name) {
throw std::runtime_error("failed to get a port number");
}
inst.subproc = std::make_shared<subprocess_s>();
inst.subproc = std::make_shared<server_subproc>();
{
SRV_INF("spawning server instance with name=%s on port %d\n", inst.meta.name.c_str(), inst.meta.port);
@@ -792,19 +880,20 @@ void server_models::load(const std::string & name) {
// TODO @ngxson : maybe separate stdout and stderr in the future
// so that we can use stdout for commands and stderr for logging
int options = subprocess_option_no_window | subprocess_option_combined_stdout_stderr;
int result = subprocess_create_ex(argv.data(), options, envp.data(), inst.subproc.get());
inst.subproc->sproc.emplace();
int result = subprocess_create_ex(argv.data(), options, envp.data(), &inst.subproc->get());
if (result != 0) {
throw std::runtime_error("failed to spawn server instance");
}
inst.stdin_file = subprocess_stdin(inst.subproc.get());
inst.stdin_file = subprocess_stdin(&inst.subproc->get());
}
// start a thread to manage the child process
// captured variables are guaranteed to be destroyed only after the thread is joined
inst.th = std::thread([this, name, child_proc = inst.subproc, port = inst.meta.port, stop_timeout = inst.meta.stop_timeout]() {
FILE * stdin_file = subprocess_stdin(child_proc.get());
FILE * stdout_file = subprocess_stdout(child_proc.get()); // combined stdout/stderr
FILE * stdin_file = subprocess_stdin(&child_proc->get());
FILE * stdout_file = subprocess_stdout(&child_proc->get()); // combined stdout/stderr
std::thread log_thread([&]() {
// read stdout/stderr and forward to main server log
@@ -829,50 +918,49 @@ void server_models::load(const std::string & name) {
});
std::thread stopping_thread([&]() {
// thread to monitor stopping signal OR child crash
// thread to monitor explicit stop requests; child crash is signalled via child_proc->stopped
auto is_stopping = [this, &name]() {
return this->stopping_models.find(name) != this->stopping_models.end();
};
auto should_wake = [&]() {
return is_stopping() || !subprocess_alive(child_proc.get());
};
{
std::unique_lock<std::mutex> lk(this->mutex);
this->cv_stop.wait(lk, should_wake);
this->cv_stop.wait(lk, [&]() {
return is_stopping() || child_proc->stopped.load(std::memory_order_acquire);
});
}
// child may have already exited (e.g. crashed) — skip shutdown sequence
if (!subprocess_alive(child_proc.get())) {
// child crashed or finished on its own — skip graceful shutdown sequence
if (child_proc->stopped.load(std::memory_order_acquire)) {
return;
}
SRV_INF("stopping model instance name=%s\n", name.c_str());
// send interrupt to child process
fprintf(stdin_file, "%s\n", CMD_ROUTER_TO_CHILD_EXIT);
fflush(stdin_file);
// wait to stop gracefully or timeout
int64_t start_time = ggml_time_ms();
while (true) {
std::unique_lock<std::mutex> lk(this->mutex);
if (!is_stopping()) {
return; // already stopped
if (!is_stopping() || child_proc->stopped.load(std::memory_order_acquire)) {
return;
}
int64_t elapsed = ggml_time_ms() - start_time;
if (elapsed >= stop_timeout * 1000) {
// timeout, force kill
lk.unlock();
SRV_WRN("force-killing model instance name=%s after %d seconds timeout\n", name.c_str(), stop_timeout);
subprocess_terminate(child_proc.get());
child_proc->terminate();
return;
}
this->cv_stop.wait_for(lk, std::chrono::seconds(1));
this->cv_stop.wait_for(lk, std::chrono::seconds(1), [&]() {
return !is_stopping() || child_proc->stopped.load(std::memory_order_acquire);
});
}
});
// we reach here when the child process exits
// we reach here when the child process exits (stdout EOF)
// note: we cannot join() prior to this point because it will close stdin_file
if (log_thread.joinable()) {
log_thread.join();
}
// stop the timeout monitoring thread
child_proc->stopped.store(true, std::memory_order_release);
{
std::lock_guard<std::mutex> lk(this->mutex);
stopping_models.erase(name);
@@ -884,8 +972,8 @@ void server_models::load(const std::string & name) {
// get the exit code
int exit_code = 0;
subprocess_join(child_proc.get(), &exit_code);
subprocess_destroy(child_proc.get());
subprocess_join(&child_proc->get(), &exit_code);
subprocess_destroy(&child_proc->get());
// update status and exit code
this->update_status(name, SERVER_MODEL_STATUS_UNLOADED, exit_code);
@@ -896,30 +984,118 @@ void server_models::load(const std::string & name) {
{
auto & old_instance = mapping[name];
// old process should have exited already, but just in case, we clean it up here
if (subprocess_alive(old_instance.subproc.get())) {
if (old_instance.subproc->is_alive()) {
SRV_WRN("old process for model name=%s is still alive, this is unexpected\n", name.c_str());
subprocess_terminate(old_instance.subproc.get()); // force kill
old_instance.subproc->terminate(); // force kill
}
if (old_instance.th.joinable()) {
old_instance.th.join();
}
}
notify_sse("model_status", name, {
{"status", server_model_status_to_string(inst.meta.status)},
});
mapping[name] = std::move(inst);
cv.notify_all();
}
// callback for model downloading functionality
struct server_models_download_res : public common_download_callback {
common_params_model model;
common_download_opts opts;
std::function<bool()> should_stop;
std::function<void(const common_download_progress & p)> on_progress;
bool is_ok = false;
bool run() {
try {
common_download_model(model, opts);
is_ok = true;
} catch (const std::exception & e) {
SRV_ERR("download failed for model name=%s: %s\n", model.name.c_str(), e.what());
is_ok = false;
}
return is_ok;
}
void on_start(const common_download_progress & p) override {
on_progress(p);
}
void on_update(const common_download_progress & p) override {
on_progress(p);
}
void on_done(const common_download_progress &, bool ok) override {
is_ok = ok;
}
bool is_cancelled() const override {
return should_stop();
}
};
void server_models::download(common_params_model && model, common_download_opts && opts) {
std::string name = model.name;
GGML_ASSERT(name == model.hf_repo);
std::unique_lock<std::mutex> lk(mutex);
if (mapping.find(name) != mapping.end()) {
throw std::runtime_error("model name=" + name + " already exists");
}
instance_t inst;
inst.meta.name = name;
inst.meta.status = SERVER_MODEL_STATUS_DOWNLOADING;
inst.subproc = std::make_shared<server_subproc>();
auto dl = std::make_unique<server_models_download_res>();
dl->model = model; // copy
dl->opts = opts; // copy
dl->should_stop = [sp = inst.subproc]() {
return sp->stopped.load(std::memory_order_relaxed);
};
dl->on_progress = [this, name](const common_download_progress & p) {
update_download_progress(name, p, false);
};
inst.th = std::thread([this, dl = std::move(dl)]() {
dl->opts.callback = dl.get();
bool ok = dl->run();
SRV_INF("download finished for model name=%s with status=%s\n",
dl->model.name.c_str(), ok ? "success" : "failure");
update_download_progress(dl->model.name, {}, true, ok);
// need_reload is set inside update_download_progress under the mutex;
// the next load_models() call will clean up this instance
});
mapping[name] = std::move(inst);
notify_sse("status_update", name, {
{"status", server_model_status_to_string(SERVER_MODEL_STATUS_DOWNLOADING)},
});
cv.notify_all();
}
void server_models::unload(const std::string & name) {
std::lock_guard<std::mutex> lk(mutex);
std::unique_lock<std::mutex> lk(mutex);
auto it = mapping.find(name);
if (it != mapping.end()) {
if (it->second.meta.is_running()) {
if (it->second.meta.status == SERVER_MODEL_STATUS_DOWNLOADING) {
SRV_INF("cancelling download for model name=%s\n", name.c_str());
it->second.subproc->stopped.store(true, std::memory_order_relaxed);
// for convenience, we wait the status change here
wait(lk, name, [](const server_model_meta & new_meta) {
return new_meta.status != SERVER_MODEL_STATUS_DOWNLOADING;
});
} else if (it->second.meta.is_running()) {
SRV_INF("stopping model instance name=%s\n", name.c_str());
stopping_models.insert(name);
if (it->second.meta.status == SERVER_MODEL_STATUS_LOADING) {
// special case: if model is in loading state, unloading means force-killing it
SRV_WRN("model name=%s is still loading, force-killing\n", name.c_str());
subprocess_terminate(it->second.subproc.get());
it->second.subproc->terminate();
}
cv_stop.notify_all();
// status change will be handled by the managing thread
@@ -934,7 +1110,10 @@ void server_models::unload_all() {
{
std::lock_guard<std::mutex> lk(mutex);
for (auto & [name, inst] : mapping) {
if (inst.meta.is_running()) {
if (inst.meta.status == SERVER_MODEL_STATUS_DOWNLOADING) {
SRV_INF("cancelling download for model name=%s\n", name.c_str());
inst.subproc->stopped.store(true, std::memory_order_relaxed);
} else if (inst.meta.is_running()) {
SRV_INF("stopping model instance name=%s\n", name.c_str());
stopping_models.insert(name);
cv_stop.notify_all();
@@ -959,6 +1138,17 @@ void server_models::update_status(const std::string & name, server_model_status
meta.status = status;
meta.exit_code = exit_code;
}
// broadcast status change to SSE
{
json data = {
{"status", server_model_status_to_string(status)},
};
if (status == SERVER_MODEL_STATUS_UNLOADED) {
data["exit_code"] = exit_code;
}
// note: notify_sse doesn't acquire the lock, so no deadlock here
notify_sse("status_change", name, data);
}
cv.notify_all();
}
@@ -985,12 +1175,82 @@ void server_models::update_loaded_info(const std::string & name, std::string & r
cv.notify_all();
}
void server_models::wait_until_loading_finished(const std::string & name) {
std::unique_lock<std::mutex> lk(mutex);
cv.wait(lk, [this, &name]() {
void server_models::update_download_progress(const std::string & name, const common_download_progress & progress, bool done, bool ok) {
json curr;
{
std::lock_guard<std::mutex> lk(mutex);
auto it = mapping.find(name);
if (it != mapping.end()) {
return it->second.meta.status != SERVER_MODEL_STATUS_LOADING;
if (done) {
// mark the instance to be erased on next load_models() call
it->second.meta.status = SERVER_MODEL_STATUS_DOWNLOADED;
need_reload = true;
} else {
json & info = it->second.meta.loaded_info;
if (!info.contains("progress")) {
info["progress"] = json{};
}
info["progress"][progress.url] = {
{"done", progress.downloaded},
{"total", progress.total},
};
curr = it->second.meta.loaded_info; // copy
}
}
}
if (done) {
cv.notify_all(); // notify in case unload() is waiting for download to be cancelled
notify_sse(ok ? "download_finished" : "download_failed", name, {});
} else {
notify_sse("download_progress", name, curr);
}
}
bool server_models::remove(const std::string & name) {
auto meta = get_meta(name);
if (!meta.has_value()) {
throw std::runtime_error("model name=" + name + " is not found");
}
if (meta->source != SERVER_MODEL_SOURCE_CACHE) {
throw std::runtime_error("model name=" + name + " is not removable (not from cache)");
}
unload(name); // cancel download or stop running instance
{
std::unique_lock<std::mutex> lk(mutex);
// a cancelled download lands on DOWNLOADED; a stopped instance lands on UNLOADED
wait(lk, name, [](const server_model_meta & new_meta) {
return new_meta.status == SERVER_MODEL_STATUS_UNLOADED
|| new_meta.status == SERVER_MODEL_STATUS_DOWNLOADED;
});
// join before erasing - after status reaches UNLOADED/DOWNLOADED the thread no
// longer acquires this mutex, so joining while holding it is safe
if (mapping[name].th.joinable()) {
mapping[name].th.join();
}
// remove the model from disk (hold lock to prevent concurrent load)
bool ok = common_download_remove(name);
if (ok) {
mapping.erase(name);
}
SRV_INF("removing model name=%s from cache (%s)\n", name.c_str(), ok ? "succeeded" : "failed");
notify_sse("model_remove", name, {});
return ok;
}
}
void server_models::wait(const std::string & name, std::function<bool(const server_model_meta &)> predicate) {
std::unique_lock<std::mutex> lk(mutex);
wait(lk, name, predicate);
}
void server_models::wait(std::unique_lock<std::mutex> & lk, const std::string & name, std::function<bool(const server_model_meta &)> predicate) {
cv.wait(lk, [this, &name, &predicate]() {
auto it = mapping.find(name);
if (it != mapping.end()) {
return predicate(it->second.meta);
}
return false;
});
@@ -1014,10 +1274,15 @@ bool server_models::ensure_model_ready(const std::string & name) {
// wait for loading to complete
SRV_INF("waiting until model name=%s is fully loaded...\n", name.c_str());
wait_until_loading_finished(name);
wait(name, [&meta](const server_model_meta & new_meta) {
if (new_meta.status != SERVER_MODEL_STATUS_LOADING) {
meta = new_meta; // update meta for final check after wait
return true;
}
return false;
});
// check final status
meta = get_meta(name);
if (!meta.has_value() || meta->is_failed()) {
throw std::runtime_error("model name=" + name + " failed to load");
}
@@ -1111,6 +1376,42 @@ void server_models::notify_router_sleeping_state(bool is_sleeping) {
// server_models_routes
//
// RAII wrapper similar to server_response_reader, but doesn't use server_queue
static std::atomic<int> sse_client_id_counter = 0;
struct server_models_sse_client {
server_response & queue_results;
int client_id;
server_models_sse_client(server_response & q)
: queue_results(q), client_id(sse_client_id_counter.fetch_add(1, std::memory_order_relaxed)) {
SRV_DBG("new SSE client connected, assigned client_id=%d\n", client_id);
queue_results.add_waiting_task_id(client_id);
}
~server_models_sse_client() {
SRV_DBG("SSE client disconnected, removing client_id=%d\n", client_id);
queue_results.remove_waiting_task_id(client_id);
}
// return nullptr if should_stop() is true before receiving a result
// note: if one error is received, it will stop further processing and return error result
server_task_result_ptr next(const std::function<bool()> & should_stop) {
while (true) {
static const int http_polling_seconds = 1; // check should_stop every 1 second
server_task_result_ptr result = queue_results.recv_with_timeout({client_id}, http_polling_seconds);
if (result == nullptr) {
// timeout, check stop condition
if (should_stop()) {
return nullptr;
}
// continue waiting otherwise
} else {
SRV_DBG("recv result for client_id=%d: %s\n", client_id, safe_json_to_str(result->to_json()).c_str());
return result;
}
}
// should not reach here
}
};
static void res_ok(std::unique_ptr<server_http_res> & res, const json & response_data) {
res->status = 200;
res->data = safe_json_to_str(response_data);
@@ -1274,7 +1575,9 @@ void server_models_routes::init_routes() {
{"created", t}, // for OAI-compat
{"status", status},
{"architecture", architecture},
{"need_download", meta.need_download},
{"source", server_model_source_to_string(meta.source)},
{"can_remove", meta.source == SERVER_MODEL_SOURCE_CACHE},
// {"need_download", meta.need_download},
// TODO: add other fields, may require reading GGUF metadata
};
@@ -1312,6 +1615,87 @@ void server_models_routes::init_routes() {
res_ok(res, {{"success", true}});
return res;
};
this->get_router_models_sse = [this](const server_http_req & req) {
auto res = std::make_unique<server_http_res>();
res->status = 200;
res->content_type = "text/event-stream";
auto sse_client = std::make_shared<server_models_sse_client>(models.sse);
res->next = [this, sse_client, &req](std::string & output) -> bool {
auto result = sse_client->next([&]() {
return stopping.load(std::memory_order_relaxed) || req.should_stop();
});
if (result == nullptr) {
return false; // client disconnected or should_stop
}
output = "data: " + safe_json_to_str(result->to_json()) + "\n\n";
return true; // listen for the next event
};
return res;
};
this->post_router_models = [this](const server_http_req & req) {
auto res = std::make_unique<server_http_res>();
json body = json::parse(req.body);
std::string name = json_value(body, "model", std::string());
if (name.empty()) {
throw std::invalid_argument("model must be a non-empty string");
}
common_params_model model;
common_download_opts opts;
model.name = name;
model.hf_repo = name;
opts.bearer_token = params.hf_token;
opts.download_mmproj = true;
opts.download_mtp = true;
// first, only check if the model is valid and can be downloaded
opts.skip_download = true;
bool ok = false;
try {
auto validation = common_download_model(model, opts);
ok = !validation.model_path.empty();
} catch (const common_skip_download_exception &) {
// model is valid and will be downloaded
ok = true;
} catch (...) {
SRV_ERR("unknown error while validating model '%s'\n", name.c_str());
// other exceptions will be handled by the outer ex_wrapper()
throw;
}
if (!ok) {
throw std::invalid_argument("model validation failed, unable to download");
}
// then, proceed with the actual download
opts.skip_download = false;
SRV_INF("starting download for model '%s'\n", name.c_str());
models.download(std::move(model), std::move(opts));
res_ok(res, {{"success", true}});
return res;
};
this->del_router_models = [this](const server_http_req & req) {
auto res = std::make_unique<server_http_res>();
std::string name = req.get_param("model");
if (name.empty()) {
throw std::invalid_argument("model must be a non-empty string");
}
bool ok = models.remove(name);
if (!ok) {
throw std::runtime_error("failed to remove model '" + name + "'");
}
res_ok(res, {{"success", true}});
return res;
};
}
+62 -27
View File
@@ -1,9 +1,11 @@
#pragma once
#include "common.h"
#include "download.h"
#include "preset.h"
#include "server-common.h"
#include "server-http.h"
#include "server-queue.h"
#include <mutex>
#include <condition_variable>
@@ -14,6 +16,8 @@
/**
* state diagram:
*
* DOWNLOADING DOWNLOADED (replaced by new instance)
*
* UNLOADED LOADING LOADED SLEEPING
*
* failed
@@ -22,39 +26,43 @@
*/
enum server_model_status {
// TODO: also add downloading state when the logic is added
SERVER_MODEL_STATUS_DOWNLOADING,
SERVER_MODEL_STATUS_DOWNLOADED,
SERVER_MODEL_STATUS_UNLOADED,
SERVER_MODEL_STATUS_LOADING,
SERVER_MODEL_STATUS_LOADED,
SERVER_MODEL_STATUS_SLEEPING
};
static server_model_status server_model_status_from_string(const std::string & status_str) {
if (status_str == "unloaded") {
return SERVER_MODEL_STATUS_UNLOADED;
}
if (status_str == "loading") {
return SERVER_MODEL_STATUS_LOADING;
}
if (status_str == "loaded") {
return SERVER_MODEL_STATUS_LOADED;
}
if (status_str == "sleeping") {
return SERVER_MODEL_STATUS_SLEEPING;
}
throw std::runtime_error("invalid server model status");
}
enum server_model_source {
SERVER_MODEL_SOURCE_PRESET,
SERVER_MODEL_SOURCE_MODELS_DIR,
SERVER_MODEL_SOURCE_CACHE,
};
static std::string server_model_status_to_string(server_model_status status) {
switch (status) {
case SERVER_MODEL_STATUS_UNLOADED: return "unloaded";
case SERVER_MODEL_STATUS_LOADING: return "loading";
case SERVER_MODEL_STATUS_LOADED: return "loaded";
case SERVER_MODEL_STATUS_SLEEPING: return "sleeping";
default: return "unknown";
case SERVER_MODEL_STATUS_DOWNLOADING: return "downloading";
case SERVER_MODEL_STATUS_DOWNLOADED: return "downloaded";
case SERVER_MODEL_STATUS_UNLOADED: return "unloaded";
case SERVER_MODEL_STATUS_LOADING: return "loading";
case SERVER_MODEL_STATUS_LOADED: return "loaded";
case SERVER_MODEL_STATUS_SLEEPING: return "sleeping";
default: return "unknown";
}
}
static std::string server_model_source_to_string(server_model_source source) {
switch (source) {
case SERVER_MODEL_SOURCE_PRESET: return "preset";
case SERVER_MODEL_SOURCE_MODELS_DIR: return "models_dir";
case SERVER_MODEL_SOURCE_CACHE: return "cache";
default: return "unknown";
}
}
struct server_model_meta {
server_model_source source = SERVER_MODEL_SOURCE_CACHE;
common_preset preset;
std::string name;
std::set<std::string> aliases; // additional names that resolve to this model
@@ -63,11 +71,11 @@ struct server_model_meta {
server_model_status status = SERVER_MODEL_STATUS_UNLOADED;
int64_t last_used = 0; // for LRU unloading
std::vector<std::string> args; // args passed to the model instance, will be populated by render_args()
json loaded_info; // info to be reflected via /v1/models endpoint
json loaded_info; // info to be reflected via /v1/models endpoint ; if in DOWNLOADING state, it should contain download progress info
int exit_code = 0; // exit code of the model instance process (only valid if status == FAILED)
int stop_timeout = 0; // seconds to wait before force-killing the model instance during shutdown
mtmd_caps multimodal; // multimodal capabilities
bool need_download = false; // whether the model needs to be downloaded before loading
// bool need_download = false; // whether the model needs to be downloaded before loading // TODO @ngxson: implement this
bool is_ready() const {
return status == SERVER_MODEL_STATUS_LOADED;
@@ -85,12 +93,15 @@ struct server_model_meta {
void update_caps();
};
struct subprocess_s;
struct server_models_routes;
struct server_subproc; // defined in server-models.cpp
struct server_models {
friend struct server_models_routes;
private:
struct instance_t {
std::shared_ptr<subprocess_s> subproc; // shared between main thread and monitoring thread
std::shared_ptr<server_subproc> subproc; // shared between main thread and monitoring thread
std::thread th;
server_model_meta meta;
FILE * stdin_file = nullptr;
@@ -107,6 +118,9 @@ private:
// set to true while load_models() is executing a reload; load() will wait until clear
bool is_reloading = false;
// if true, the next get_meta() will trigger a reload of model list
bool need_reload = false;
common_preset_context ctx_preset;
common_params base_params;
@@ -122,9 +136,14 @@ private:
// not thread-safe, caller must hold mutex
void add_model(server_model_meta && meta);
// notify SSE clients
void notify_sse(const std::string & event, const std::string & model_id, const json & data = nullptr);
public:
server_models(const common_params & params, int argc, char ** argv);
server_response sse; // for real-time updates via SSE endpoint
// (re-)load the list of models from various sources and prepare the metadata mapping
// - if this is called the first time, simply populate the metadata
// - if this is called subsequently (e.g. when refreshing from disk):
@@ -147,13 +166,24 @@ public:
void unload(const std::string & name);
void unload_all();
// download a new model, progress is reported via SSE
// to stop the download, call unload()
void download(common_params_model && model, common_download_opts && opts);
// update the status of a model instance (thread-safe)
void update_status(const std::string & name, server_model_status status, int exit_code);
void update_loaded_info(const std::string & name, std::string & raw_info);
void update_download_progress(const std::string & name, const common_download_progress & progress, bool done, bool ok = true);
// remove a cache model from disk and update the list (thread-safe)
// note: only cache models can be removed; returns false if the model doesn't exist or is not a cache model
bool remove(const std::string & name);
// wait until the model instance is fully loaded (thread-safe)
// note: predicate is called while holding the lock
// return when the model no longer in "loading" state
void wait_until_loading_finished(const std::string & name);
void wait(const std::string & name, std::function<bool(const server_model_meta &)> predicate);
void wait(std::unique_lock<std::mutex> & lk, const std::string & name, std::function<bool(const server_model_meta &)> predicate);
// ensure the model is in ready state (thread-safe)
// return false if model is ready
@@ -176,8 +206,9 @@ public:
struct server_models_routes {
common_params params;
json ui_settings = json::object(); // Primary: new name
json webui_settings = json::object(); // Deprecated: use ui_settings (kept for compat)
json ui_settings = json::object(); // Primary: new name
json webui_settings = json::object(); // Deprecated: use ui_settings (kept for compat)
std::atomic<bool> stopping = false; // for graceful disconnecting SSE clients during shutdown
server_models models;
server_models_routes(const common_params & params, int argc, char ** argv)
: params(params), models(params, argc, argv) {
@@ -206,6 +237,10 @@ struct server_models_routes {
server_http_context::handler_t get_router_models;
server_http_context::handler_t post_router_models_load;
server_http_context::handler_t post_router_models_unload;
// management API
server_http_context::handler_t get_router_models_sse;
server_http_context::handler_t post_router_models;
server_http_context::handler_t del_router_models;
};
/**
+11
View File
@@ -331,6 +331,17 @@ void server_response::send(server_task_result_ptr && result) {
}
}
void server_response::broadcast(server_task_result_ptr && result) {
std::unique_lock<std::mutex> lock(mutex_results);
for (const auto & id_task : waiting_task_ids) {
RES_DBG("task id = %d pushed to result queue\n", id_task);
server_task_result_ptr res_copy(result->clone());
res_copy->id = id_task; // override id with target task id
queue_results.emplace_back(std::move(res_copy));
}
condition_results.notify_all();
}
void server_response::terminate() {
running = false;
condition_results.notify_all();
+5 -1
View File
@@ -154,11 +154,15 @@ public:
// Send a new result to a waiting id_task
void send(server_task_result_ptr && result);
// broadcast a new result to all waiting tasks
// (used by router mode)
void broadcast(server_task_result_ptr && result);
// terminate the waiting loop
void terminate();
};
// utility class to make working with server_queue and server_response easier
// RAII wrapper to make working with server_queue and server_response easier
// it provides a generator-like API for server responses
// support pooling connection state and aggregating multiple results
struct server_response_reader {
+635
View File
@@ -0,0 +1,635 @@
#include "server-schema.h"
#include "json-schema-to-grammar.h"
namespace server_schema {
//
// llama.cpp-specific completion schema
//
std::vector<std::unique_ptr<field>> make_llama_cmpl_schema(const common_params & params_base, task_params & params) {
std::vector<std::unique_ptr<field>> fields;
auto add = [&](field * f) {
fields.emplace_back(f);
};
add((new field_bool("timings_per_token", params.timings_per_token))
->set_desc("Include prompt processing and text generation speed information in each response"));
add((new field_bool("stream", params.stream))
->set_desc("Allows receiving each predicted token in real-time instead of waiting for the completion to finish"));
add((new field_nested("stream_options"))
->add_subfield((new field_bool("include_usage", params.include_usage))
->set_desc("Whether to include usage information in the stream"))
->set_desc("Additional options for streaming responses"));
add((new field_bool("cache_prompt", params.cache_prompt))
->set_desc("Re-use KV cache from a previous request if possible. This way the common prefix does not have to be re-processed, only the suffix that differs between the requests"));
add((new field_bool("return_tokens", params.return_tokens))
->set_desc("Return the raw generated token ids in the `tokens` field"));
add((new field_bool("return_progress", params.return_progress))
->set_desc("Include prompt processing progress events in stream mode"));
add((new field_num("n_predict", params.n_predict))
->set_hard_limits(-1, INT32_MAX)
->add_alias("max_completion_tokens")
->add_alias("max_tokens")
->set_desc("Set the maximum number of tokens to predict. When 0, no tokens will be generated but the prompt is evaluated into the cache"));
add((new field_num("n_indent", params.n_indent))
->set_hard_limits(0, INT32_MAX)
->set_desc("Specify the minimum line indentation for the generated text in number of whitespace characters. Useful for code completion tasks"));
add((new field_num("n_keep", params.n_keep))
->set_hard_limits(-1, INT32_MAX)
->set_desc("Specify the number of tokens from the initial prompt to retain when context size is exceeded. Use -1 to retain all tokens from the prompt"));
add((new field_num("n_discard", params.n_discard))
->set_hard_limits(0, INT32_MAX)
->set_desc("Number of tokens after n_keep that may be discarded when shifting context (0 = half context)"));
add((new field_num("n_cmpl", params.n_cmpl))
->set_hard_limits(1, params_base.n_parallel)
->add_alias("n") // alias "n" as fallback (OpenAI completions API)
->set_desc("Number of completions to generate. If the input has multiple prompts, total outputs will be N prompts times n_cmpl"));
add((new field_num("n_cache_reuse", params.n_cache_reuse))
->set_hard_limits(0, INT32_MAX)
->set_desc("Min chunk size to attempt reusing from the cache via KV shifting. See --cache-reuse arg"));
// TODO: implement t_max_prompt_ms
// add((new field_num("t_max_prompt_ms", params.t_max_prompt_ms))
add((new field_num("t_max_predict_ms", params.t_max_predict_ms))
->set_hard_limits(-1, std::numeric_limits<int64_t>::max())
->set_desc("Set a time limit in milliseconds for the prediction phase. The timeout triggers if generation exceeds this time (measured since the first token) and a newline has been generated. Useful for FIM applications"));
add((new field_json("response_fields"))
->set_desc("A list of response fields to return. Missing fields are omitted without error. Fields with a slash are unnested (e.g. generation_settings/n_predict moves n_predict to the root)")
->set_handler([&](field_eval_context & ctx, const json & data) {
ctx.params.response_fields = json_value(data, "response_fields", std::vector<std::string>());
}));
//
// Sampling params
//
add((new field_num("top_k", params.sampling.top_k))
->set_limits(0, INT32_MAX)
->set_desc("Limit the next token selection to the K most probable tokens (0 = disabled)"));
add((new field_num("top_p", params.sampling.top_p))
->set_limits(0.0f, 1.0f)
->set_desc("Limit the next token selection to a subset of tokens with cumulative probability above threshold P (1.0 = disabled)"));
add((new field_num("min_p", params.sampling.min_p))
->set_limits(0.0f, 1.0f)
->set_desc("The minimum probability for a token to be considered, relative to the probability of the most likely token (0 = disabled)"));
add((new field_num("top_n_sigma", params.sampling.top_n_sigma))
->set_desc("Keep tokens within n standard deviations of the top token logit (< 0 = disabled)"));
add((new field_num("xtc_probability", params.sampling.xtc_probability))
->set_limits(0.0f, 1.0f)
->set_desc("Set the chance for token removal via XTC sampler (0 = disabled)"));
add((new field_num("xtc_threshold", params.sampling.xtc_threshold))
->set_limits(0.0f, 1.0f)
->set_desc("Set a minimum probability threshold for tokens to be removed via XTC sampler (> 0.5 disables XTC)"));
add((new field_num("typical_p", params.sampling.typ_p))
// ->set_limits(0.0f, 1.0f) // what's the valid range?
->set_desc("Enable locally typical sampling with parameter p (1.0 = disabled)"));
add((new field_num("temperature", params.sampling.temp))
->set_limits(0.0f, std::numeric_limits<float>::infinity())
->set_desc("Adjust the randomness of the generated text (0 = greedy)"));
add((new field_num("dynatemp_range", params.sampling.dynatemp_range))
->set_desc("Dynamic temperature range. The final temperature will be in [temperature - range, temperature + range] (0 = disabled)"));
add((new field_num("dynatemp_exponent", params.sampling.dynatemp_exponent))
->set_desc("Dynamic temperature exponent, controls how entropy maps to temperature"));
add((new field_num("repeat_last_n", params.sampling.penalty_last_n))
->set_hard_limits(-1, INT32_MAX)
->set_desc("Last n tokens to consider for penalizing repetition (0 = disabled, -1 = ctx-size)"));
add((new field_num("repeat_penalty", params.sampling.penalty_repeat))
->set_desc("Control the repetition of token sequences in the generated text (1.0 = disabled)"));
add((new field_num("frequency_penalty", params.sampling.penalty_freq))
->set_desc("Repeat alpha frequency penalty (0 = disabled)"));
add((new field_num("presence_penalty", params.sampling.penalty_present))
->set_desc("Repeat alpha presence penalty (0 = disabled)"));
add((new field_num("dry_multiplier", params.sampling.dry_multiplier))
->set_desc("Set the DRY (Don't Repeat Yourself) repetition penalty multiplier (0 = disabled)"));
add((new field_num("dry_base", params.sampling.dry_base))
->set_desc("Set the DRY repetition penalty base value (must be >= 1.0, any values < 1.0 will be replaced with the default value)")
->set_handler([&](field_eval_context & ctx, const json & data) {
float v = data.at("dry_base").get<float>();
ctx.params.sampling.dry_base = (v < 1.0f) ? params_base.sampling.dry_base : v;
}));
add((new field_num("dry_allowed_length", params.sampling.dry_allowed_length))
->set_hard_limits(0, INT32_MAX)
->set_desc("Tokens that extend repetition beyond this length receive exponentially increasing penalty: multiplier * base ^ (sequence_length - allowed_length)"));
add((new field_num("dry_penalty_last_n", params.sampling.dry_penalty_last_n))
->set_hard_limits(-1, INT32_MAX)
->set_desc("How many tokens to scan for repetitions (0 = disabled, -1 = context size)"));
add((new field_num("mirostat", params.sampling.mirostat))
->set_limits(0, 2)
->set_desc("Enable Mirostat sampling, controlling perplexity during text generation (0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)"));
add((new field_num("mirostat_tau", params.sampling.mirostat_tau))
->set_desc("Set the Mirostat target entropy, parameter tau"));
add((new field_num("mirostat_eta", params.sampling.mirostat_eta))
->set_desc("Set the Mirostat learning rate, parameter eta"));
add((new field_num("adaptive_target", params.sampling.adaptive_target))
->set_limits(-std::numeric_limits<float>::max(), 1.0f)
->set_desc("Adaptive sampling target entropy (valid range 0.0 to 1.0; negative = disabled)"));
add((new field_num("adaptive_decay", params.sampling.adaptive_decay))
->set_hard_limits(0.0f, 0.99f)
->set_desc("EMA decay for adaptive sampling; history approximates 1/(1-decay) tokens"));
// seed is uint32_t; field_num uses int32_t so use a handler
add((new field_num("seed", params.sampling.seed))
->set_desc("Set the random number generator (RNG) seed (-1 = random)"));
add((new field_num("n_probs", params.sampling.n_probs))
->add_alias("logprobs") // use "logprobs" if "n_probs" wasn't provided
->set_desc("If greater than 0, output the probabilities of top N tokens for each generated token"));
add((new field_num("min_keep", params.sampling.min_keep))
->set_hard_limits(0, INT32_MAX)
->set_desc("If greater than 0, force samplers to return at least N possible tokens"));
add((new field_bool("backend_sampling", params.sampling.backend_sampling))
->set_desc("Use backend sampling instead of llama.cpp sampling"));
add((new field_bool("post_sampling_probs", params.post_sampling_probs))
->set_desc("Return probabilities of top n_probs tokens after applying the sampling chain"));
//
// Speculative decoding params
//
// TODO: to keep things simple, we disable speculative parameter adjustments for now
#if 0
// TODO: for now, be able to adjust only the draft-model based speculative parameters
add((new field_num("speculative.n_max", params.speculative.draft.n_max))
->set_hard_limits(0, INT32_MAX)
->set_desc("Maximum number of tokens to draft during speculative decoding"));
add((new field_num("speculative.n_min", params.speculative.draft.n_min))
->set_hard_limits(0, INT32_MAX)
->set_desc("Minimum number of draft tokens to use for speculative decoding");
add((new field_num("speculative.p_min", params.speculative.draft.p_min))
->set_hard_limits(0.0f, 1.0f)
->set_desc("Minimum speculative decoding probability for draft tokens (0 = greedy)"));
add((new field_str("speculative.type"))
->set_desc("Speculative decoding method (for debugging and research purposes)")
->set_handler([&](field_eval_context & ctx, const json & data) {
ctx.params.speculative.types = { common_speculative_type_from_name(data.at("speculative.type").get<std::string>()) };
}));
add((new field_num("speculative.ngram_size_n", params.speculative.ngram_simple.size_n))
->set_desc("Ngram size for lookup in ngram-based speculative decoding"));
add((new field_num("speculative.ngram_size_m", params.speculative.ngram_simple.size_m))
->set_desc("Mgram size for speculative tokens in ngram-based speculative decoding"));
add((new field_num("speculative.ngram_min_hits", params.speculative.ngram_simple.min_hits))
->set_desc("Minimum hits at ngram lookup for mgram to be proposed"));
#endif
add((new field_json("lora"))
->set_desc("A list of LoRA adapters to apply to this request. Each entry must have `id` and `scale` fields. Adapters not listed default to scale 0.0")
->set_handler([&](field_eval_context & ctx, const json & data) {
const auto & lora = data.at("lora");
if (!lora.is_array()) {
throw std::runtime_error("Error: 'lora' must be an array of objects with 'id' and 'scale' fields");
}
ctx.params.lora = parse_lora_request(lora);
}));
// sequence breakers for DRY
// Currently, this is not compatible with TextGen WebUI, Koboldcpp and SillyTavern format
// Ref: https://github.com/oobabooga/text-generation-webui/blob/d1af7a41ade7bd3c3a463bfa640725edb818ebaf/extensions/openai/typing.py#L39
add((new field_json("dry_sequence_breakers"))
->set_desc("Specify an array of sequence breakers for DRY sampling. Only a JSON array of strings is accepted")
->set_handler([&](field_eval_context & ctx, const json & data) {
ctx.params.sampling.dry_sequence_breakers = json_value(data, "dry_sequence_breakers", std::vector<std::string>());
if (ctx.params.sampling.dry_sequence_breakers.empty()) {
throw std::runtime_error("Error: dry_sequence_breakers must be a non-empty array of strings");
}
}));
// handle both "json_schema" and "grammar"
add((new field_json("json_schema"))
->add_alias("grammar")
->set_desc("Set a JSON schema (json_schema) or GBNF grammar string (grammar) for constrained generation. json_schema takes precedence if both are provided")
->set_handler([&](field_eval_context & ctx, const json & data) {
auto & params = ctx.params;
if (data.contains("json_schema") && !data.contains("grammar")) {
try {
auto schema = json_value(data, "json_schema", json::object());
SRV_DBG("JSON schema: %s\n", schema.dump(2).c_str());
std::string grammar_str = json_schema_to_grammar(schema);
SRV_DBG("Converted grammar: %s\n", grammar_str.c_str());
params.sampling.grammar = {COMMON_GRAMMAR_TYPE_OUTPUT_FORMAT, std::move(grammar_str)};
} catch (const std::exception & e) {
throw std::runtime_error(std::string("\"json_schema\": ") + e.what());
}
} else {
std::string grammar_str = json_value(data, "grammar", std::string());
if (!grammar_str.empty()) {
// grammar_type key is set by the server when converting chat template grammars
std::string grammar_type = json_value(data, "grammar_type", std::string());
if (grammar_type == "tool_calls") {
params.sampling.grammar = {COMMON_GRAMMAR_TYPE_TOOL_CALLS, std::move(grammar_str)};
} else {
// explicit grammar from the user (API field "grammar")
params.sampling.grammar = {COMMON_GRAMMAR_TYPE_USER, std::move(grammar_str)};
}
SRV_DBG("Grammar (%s): %s\n", grammar_type.c_str(), common_grammar_value(params.sampling.grammar).c_str());
}
}
}));
add((new field_bool("grammar_lazy", params.sampling.grammar_lazy))
->set_desc("Whether to apply grammar constraints lazily, only when triggered (instead of at every step)"));
//
// Chat parser params
//
// TODO: change this to string field instead
add((new field_json("chat_format"))
->set_desc("Chat format used internally by the server")
->set_handler([&](field_eval_context & ctx, const json & data) {
ctx.params.chat_parser_params.format = static_cast<common_chat_format>(data.at("chat_format").get<int>());
SRV_INF("Chat format: %s\n", common_chat_format_name(ctx.params.chat_parser_params.format));
}));
add((new field_str("reasoning_format"))
->set_desc("Reasoning format for chain-of-thought models")
->set_handler([&](field_eval_context & ctx, const json & data) {
auto reasoning_format = common_reasoning_format_from_name(data.at("reasoning_format").get<std::string>());
ctx.params.chat_parser_params.reasoning_format = reasoning_format;
ctx.params.chat_parser_params.reasoning_in_content = ctx.params.stream && (reasoning_format == COMMON_REASONING_FORMAT_DEEPSEEK_LEGACY);
}));
add((new field_str("generation_prompt"))
->set_desc("Generation prompt appended to the chat template output")
->set_handler([&](field_eval_context & ctx, const json & data) {
std::string s = data.at("generation_prompt").get<std::string>();
ctx.params.chat_parser_params.generation_prompt = s;
ctx.params.sampling.generation_prompt = s;
}));
add((new field_bool("parse_tool_calls", params.chat_parser_params.parse_tool_calls))
->set_desc("Whether to parse tool calls from the generated output"));
add((new field_str("chat_parser"))
->set_desc("Chat parser configuration string")
->set_handler([&](field_eval_context & ctx, const json & data) {
ctx.params.chat_parser_params.parser.load(data.at("chat_parser").get<std::string>());
}));
add((new field_json("continue_final_message"))
->set_desc("Whether to continue the final message of the chat template")
->set_handler([&](field_eval_context & ctx, const json & data) {
auto continuation = common_chat_continuation_parse(data.at("continue_final_message"));
ctx.params.chat_parser_params.is_continuation = continuation != COMMON_CHAT_CONTINUATION_NONE;
}));
add((new field_bool("echo", params.chat_parser_params.echo))
->set_desc("Whether to echo the input tokens in the output"));
//
// Token-level fields (require vocab)
//
add((new field_json("preserved_tokens"))
->set_desc("List of token strings that must not be split during tokenization")
->set_handler([&](field_eval_context & ctx, const json & data) {
GGML_ASSERT(ctx.vocab != nullptr);
for (const auto & t : data.at("preserved_tokens")) {
auto ids = common_tokenize(ctx.vocab, t.get<std::string>(), false, true);
if (ids.size() == 1) {
ctx.params.sampling.preserved_tokens.insert(ids[0]);
}
}
}));
add((new field_json("grammar_triggers"))
->set_desc("List of strings or patterns that trigger grammar-constrained generation")
->set_handler([&](field_eval_context & ctx, const json & data) {
GGML_ASSERT(ctx.vocab != nullptr);
for (const auto & t : data.at("grammar_triggers")) {
server_grammar_trigger ct(t);
if (ct.value.type == COMMON_GRAMMAR_TRIGGER_TYPE_WORD) {
const auto & word = ct.value.value;
auto ids = common_tokenize(ctx.vocab, word, false, true);
if (ids.size() == 1) {
auto token = ids[0];
if (std::find(ctx.params.sampling.preserved_tokens.begin(), ctx.params.sampling.preserved_tokens.end(), (llama_token) token) == ctx.params.sampling.preserved_tokens.end()) {
throw std::runtime_error("Grammar trigger word should be marked as preserved token: " + word);
}
common_grammar_trigger trigger;
trigger.type = COMMON_GRAMMAR_TRIGGER_TYPE_TOKEN;
trigger.value = word;
trigger.token = token;
ctx.params.sampling.grammar_triggers.push_back(std::move(trigger));
} else {
ctx.params.sampling.grammar_triggers.push_back({COMMON_GRAMMAR_TRIGGER_TYPE_WORD, word});
}
} else {
ctx.params.sampling.grammar_triggers.emplace_back(std::move(ct.value));
}
}
if (ctx.params.sampling.grammar_lazy && ctx.params.sampling.grammar_triggers.empty()) {
throw std::runtime_error("Error: no triggers set for lazy grammar!");
}
}));
add((new field_bool("reasoning_control", params.sampling.reasoning_control))
->set_desc("Create the budget sampler on demand so reasoning can be ended at runtime"));
add((new field_num("reasoning_budget_tokens", params.sampling.reasoning_budget_tokens))
->set_hard_limits(-1, INT32_MAX)
->set_desc("Number of tokens in the reasoning budget (-1 = disabled)"));
add((new field_str("reasoning_budget_start_tag"))
->set_desc("Token string marking the start of the reasoning budget section")
->set_handler([&](field_eval_context & ctx, const json & data) {
GGML_ASSERT(ctx.vocab != nullptr);
ctx.params.sampling.reasoning_budget_start = common_tokenize(ctx.vocab, data.at("reasoning_budget_start_tag").get<std::string>(), false, true);
}));
add((new field_str("reasoning_budget_end_tag"))
->set_desc("Token string marking the end of the reasoning budget section")
->set_handler([&](field_eval_context & ctx, const json & data) {
GGML_ASSERT(ctx.vocab != nullptr);
std::string end_tag = data.at("reasoning_budget_end_tag").get<std::string>();
ctx.params.sampling.reasoning_budget_end = common_tokenize(ctx.vocab, end_tag, false, true);
}));
add((new field_str("reasoning_budget_message"))
->set_desc("Message to prepend to the reasoning budget end tag when forcing it")
->set_handler([&](field_eval_context & ctx, const json & data) {
GGML_ASSERT(ctx.vocab != nullptr);
std::string end_tag = json_value(data, "reasoning_budget_end_tag", std::string());
std::string message = data.at("reasoning_budget_message").get<std::string>();
ctx.params.sampling.reasoning_budget_forced = common_tokenize(ctx.vocab, message + end_tag, false, true);
}));
add((new field_json("logit_bias"))
->set_desc("Modify the likelihood of specific tokens. Accepts an array of [token, bias] pairs or an object mapping token to bias. Use false as bias to ban a token")
->set_handler([&](field_eval_context & ctx, const json & data) {
GGML_ASSERT(ctx.vocab != nullptr);
ctx.params.sampling.logit_bias.clear();
const auto & logit_bias = data.at("logit_bias");
const int n_vocab = llama_vocab_n_tokens(ctx.vocab);
auto parse_bias = [](const json & v, float & bias) -> bool {
if (v.is_number()) { bias = v.get<float>(); return true; }
if (v.is_boolean() && !v.get<bool>()) { bias = -INFINITY; return true; }
return false;
};
if (logit_bias.is_array()) {
for (const auto & el : logit_bias) {
if (!el.is_array() || el.size() != 2) continue;
float bias;
if (!parse_bias(el[1], bias)) continue;
if (el[0].is_number_integer()) {
llama_token tok = el[0].get<llama_token>();
if (tok >= 0 && tok < n_vocab) ctx.params.sampling.logit_bias.push_back({tok, bias});
} else if (el[0].is_string()) {
for (auto tok : common_tokenize(ctx.vocab, el[0].get<std::string>(), false))
ctx.params.sampling.logit_bias.push_back({tok, bias});
}
}
} else if (logit_bias.is_object()) {
for (const auto & el : logit_bias.items()) {
float bias;
if (!parse_bias(el.value(), bias)) continue;
char * end;
llama_token tok = strtol(el.key().c_str(), &end, 10);
if (*end == 0) {
if (tok >= 0 && tok < n_vocab) ctx.params.sampling.logit_bias.push_back({tok, bias});
} else {
for (auto t : common_tokenize(ctx.vocab, el.key(), false))
ctx.params.sampling.logit_bias.push_back({t, bias});
}
}
}
}));
add((new field_bool("ignore_eos", params.sampling.ignore_eos))
->set_desc("Ignore the end-of-sequence token and continue generating")
->set_handler([&](field_eval_context & ctx, const json & data) {
GGML_ASSERT(ctx.logit_bias_eog != nullptr);
ctx.params.sampling.ignore_eos = data.at("ignore_eos").get<bool>();
if (ctx.params.sampling.ignore_eos && ctx.logit_bias_eog) {
ctx.params.sampling.logit_bias.insert(
ctx.params.sampling.logit_bias.end(),
ctx.logit_bias_eog->begin(), ctx.logit_bias_eog->end());
}
}));
add((new field_json("stop"))
->set_desc("Specify stopping strings. Generation stops when one is produced, and the string is not included in the output")
->set_handler([&](field_eval_context & ctx, const json & data) {
ctx.params.antiprompt.clear();
const auto & stop = data.at("stop");
if (stop.is_array()) {
for (const auto & word : stop) {
if (!word.empty()) ctx.params.antiprompt.push_back(word);
}
} else if (stop.is_string()) {
ctx.params.antiprompt.push_back(stop.get<std::string>());
}
// fall back to CLI defaults if the request provided no effective stop strings
if (ctx.params.antiprompt.empty()) {
ctx.params.antiprompt = params_base.antiprompt;
}
}));
add((new field_json("samplers"))
->set_desc("The order in which samplers are applied. An array of sampler type names, or a single string of sampler chars")
->set_handler([&](field_eval_context & ctx, const json & data) {
const auto & samplers = data.at("samplers");
if (samplers.is_array()) {
ctx.params.sampling.samplers = common_sampler_types_from_names(samplers);
} else if (samplers.is_string()) {
ctx.params.sampling.samplers = common_sampler_types_from_chars(samplers.get<std::string>());
}
}));
return fields;
}
task_params eval_llama_cmpl_schema(
const llama_vocab * vocab,
const common_params & params_base,
const int n_ctx_slot,
const std::vector<llama_logit_bias> & logit_bias_eog,
const json & data) {
task_params params;
// Sampling parameter defaults are loaded from the global server context (but individual requests can still them)
params.sampling = params_base.sampling;
params.speculative = params_base.speculative;
params.n_keep = params_base.n_keep;
params.n_predict = params_base.n_predict;
params.n_cache_reuse = params_base.n_cache_reuse;
params.cache_prompt = params_base.cache_prompt;
params.antiprompt = params_base.antiprompt;
// enabling this will output extra debug information in the HTTP responses from the server
params.verbose = params_base.verbosity > 9;
params.chat_parser_params.reasoning_format = params_base.reasoning_format;
// create context and schema
field_eval_context ctx(params);
ctx.vocab = vocab;
ctx.logit_bias_eog = &logit_bias_eog;
auto schema = make_llama_cmpl_schema(params_base, params);
// eval all fields in the schema
for (const auto & f : schema) {
f->eval(ctx, data);
}
// post-processing
{
if (params.sampling.penalty_last_n == -1) {
// note: should be the slot's context and not the full context, but it's ok
params.sampling.penalty_last_n = n_ctx_slot;
}
if (params.sampling.dry_penalty_last_n == -1) {
params.sampling.dry_penalty_last_n = n_ctx_slot;
}
// if "reasoning_format" is not provided, its handler will not be called, we will need to handle it here
auto reasoning_format = params.chat_parser_params.reasoning_format;
params.chat_parser_params.reasoning_in_content = params.stream && (reasoning_format == COMMON_REASONING_FORMAT_DEEPSEEK_LEGACY);
}
// debugging
{
auto budget = params.sampling.reasoning_budget_tokens;
SRV_DBG("reasoning budget: tokens=%d, generation_prompt='%s', start=%zu toks, end=%zu toks, forced=%zu toks\n",
budget, params.sampling.generation_prompt.c_str(),
params.sampling.reasoning_budget_start.size(),
params.sampling.reasoning_budget_end.size(),
params.sampling.reasoning_budget_forced.size());
}
return params;
}
//
// eval() implementations
//
static void handle_with_catch(const char * name, std::function<void()> func) {
try {
func();
} catch (const std::exception & e) {
throw std::invalid_argument(string_format("Field '%s': %s", name, e.what()));
}
}
template <typename T>
void field_num<T>::eval(field_eval_context & ctx, const json & data) {
for (const auto & n : name) {
if (data.contains(n)) {
handle_with_catch(n, [&]() {
if (custom_handler) {
custom_handler(ctx, data);
} else if (!is_hard_limit) {
val = std::max(min, std::min(max, data.at(n).template get<T>()));
} else {
T tmp = data.at(n).template get<T>();
if (tmp < min || tmp > max) {
throw std::invalid_argument(std::string("Value must be between ") + std::to_string(min) + " <= value <= " + std::to_string(max) + ", but got " + std::to_string(tmp));
}
val = tmp;
}
});
return;
}
}
}
void field_str::eval(field_eval_context & ctx, const json & data) {
GGML_ASSERT(custom_handler);
for (const auto & n : name) {
if (data.contains(n)) {
handle_with_catch(n, [&]() {
custom_handler(ctx, data);
});
return;
}
}
}
void field_bool::eval(field_eval_context & ctx, const json & data) {
for (const auto & n : name) {
if (data.contains(n)) {
handle_with_catch(n, [&]() {
if (custom_handler) {
custom_handler(ctx, data);
} else {
val = data.at(n).get<bool>();
}
});
return;
}
}
}
void field_json::eval(field_eval_context & ctx, const json & data) {
GGML_ASSERT(custom_handler);
for (const auto & n : name) {
if (data.contains(n)) {
handle_with_catch(n, [&]() {
custom_handler(ctx, data);
});
return;
}
}
}
void field_nested::eval(field_eval_context & ctx, const json & data) {
for (const auto & n : name) {
if (data.contains(n) && data.at(n).is_object()) {
for (auto & f : subfields) {
f->eval(ctx, data.at(n));
}
return;
}
}
}
} // namespace server_schema
+105
View File
@@ -0,0 +1,105 @@
#pragma once
#include "server-common.h"
#include "server-task.h"
#include "sampling.h"
#include "speculative.h"
#include <climits>
#include <functional>
#include <limits>
#include <memory>
#include <string>
#include <vector>
namespace server_schema {
struct field_eval_context {
task_params & params;
const llama_vocab * vocab = nullptr;
const std::vector<llama_logit_bias> * logit_bias_eog = nullptr;
field_eval_context(task_params & params) : params(params) {}
};
using field_handler = std::function<void(field_eval_context &, const json &)>;
struct field {
std::vector<const char *> name;
const char * desc = "";
field_handler custom_handler;
field() = default;
field(const char * n) : name({n}) {}
virtual ~field() = default;
field * set_desc(const char * s) {
desc = s;
return this;
}
// if 'name' is present, use it, otherwise look for aliases following the order they were added
field * add_alias(const char * n) {
name.push_back(n);
return this;
}
field * set_handler(field_handler h) { this->custom_handler = h; return this; }
virtual void eval(field_eval_context & ctx, const json & data) = 0;
};
template <typename T = int32_t>
struct field_num : public field {
T & val;
T min = std::numeric_limits<T>::lowest();
T max = std::numeric_limits<T>::max();
bool is_hard_limit = false; // if true, throw error if the value is invalid
field_num(const char * n, T & val) : field(n), val(val) {}
// limits are inclusive, min <= value <= max
field_num * set_limits(T min, T max) {
this->min = min;
this->max = max;
return this;
}
field_num * set_hard_limits(T min, T max) {
set_limits(min, max);
is_hard_limit = true;
return this;
}
virtual void eval(field_eval_context & ctx, const json & data) override;
};
struct field_str : public field {
field_str(const char * n) : field(n) {}
virtual void eval(field_eval_context & ctx, const json & data) override;
};
struct field_bool : public field {
bool & val;
field_bool(const char * n, bool & val) : field(n), val(val) {}
virtual void eval(field_eval_context & ctx, const json & data) override;
};
struct field_json : public field {
field_json(const char * n) : field(n) {}
virtual void eval(field_eval_context & ctx, const json & data) override;
};
struct field_nested : public field {
std::vector<std::unique_ptr<field>> subfields;
field_nested(const char * n) : field(n) {}
field_nested * add_subfield(field * f) {
subfields.emplace_back(std::unique_ptr<field>(f));
return this;
}
virtual void eval(field_eval_context & ctx, const json & data) override;
};
std::vector<std::unique_ptr<field>> make_llama_cmpl_schema(
const common_params & params_base,
task_params & params);
task_params eval_llama_cmpl_schema(
const llama_vocab * vocab,
const common_params & params_base,
const int n_ctx_slot,
const std::vector<llama_logit_bias> & logit_bias_eog,
const json & data);
} // namespace server_schema
-388
View File
@@ -232,396 +232,8 @@ common_chat_msg task_result_state::update_chat_msg(
return chat_msg;
}
//
// server_task
//
task_params server_task::params_from_json_cmpl(
const llama_vocab * vocab,
const common_params & params_base,
const int n_ctx_slot,
const std::vector<llama_logit_bias> & logit_bias_eog,
const json & data) {
task_params params;
// Sampling parameter defaults are loaded from the global server context (but individual requests can still them)
task_params defaults;
defaults.sampling = params_base.sampling;
defaults.speculative = params_base.speculative;
defaults.n_keep = params_base.n_keep;
defaults.n_predict = params_base.n_predict;
defaults.n_cache_reuse = params_base.n_cache_reuse;
defaults.cache_prompt = params_base.cache_prompt;
defaults.antiprompt = params_base.antiprompt;
// enabling this will output extra debug information in the HTTP responses from the server
params.verbose = params_base.verbosity > 9;
params.timings_per_token = json_value(data, "timings_per_token", false);
params.stream = json_value(data, "stream", false);
auto stream_opt = json_value(data, "stream_options", json::object());
params.include_usage = json_value(stream_opt, "include_usage", false);
params.cache_prompt = json_value(data, "cache_prompt", defaults.cache_prompt);
params.return_tokens = json_value(data, "return_tokens", false);
params.return_progress = json_value(data, "return_progress", false);
auto max_tokens = json_value(data, "max_tokens", defaults.n_predict);
params.n_predict = json_value(data, "n_predict", json_value(data, "max_completion_tokens", max_tokens));
params.n_indent = json_value(data, "n_indent", defaults.n_indent);
params.n_keep = json_value(data, "n_keep", defaults.n_keep);
params.n_discard = json_value(data, "n_discard", defaults.n_discard);
params.n_discard = std::max(0, params.n_discard);
params.n_cmpl = json_value(data, "n_cmpl", json_value(data, "n", 1));
params.n_cache_reuse = json_value(data, "n_cache_reuse", defaults.n_cache_reuse);
//params.t_max_prompt_ms = json_value(data, "t_max_prompt_ms", defaults.t_max_prompt_ms); // TODO: implement
params.t_max_predict_ms = json_value(data, "t_max_predict_ms", defaults.t_max_predict_ms);
params.response_fields = json_value(data, "response_fields", std::vector<std::string>());
params.sampling.top_k = json_value(data, "top_k", defaults.sampling.top_k);
params.sampling.top_p = json_value(data, "top_p", defaults.sampling.top_p);
params.sampling.min_p = json_value(data, "min_p", defaults.sampling.min_p);
params.sampling.top_n_sigma = json_value(data, "top_n_sigma", defaults.sampling.top_n_sigma);
params.sampling.xtc_probability = json_value(data, "xtc_probability", defaults.sampling.xtc_probability);
params.sampling.xtc_threshold = json_value(data, "xtc_threshold", defaults.sampling.xtc_threshold);
params.sampling.typ_p = json_value(data, "typical_p", defaults.sampling.typ_p);
params.sampling.temp = json_value(data, "temperature", defaults.sampling.temp);
params.sampling.dynatemp_range = json_value(data, "dynatemp_range", defaults.sampling.dynatemp_range);
params.sampling.dynatemp_exponent = json_value(data, "dynatemp_exponent", defaults.sampling.dynatemp_exponent);
params.sampling.penalty_last_n = json_value(data, "repeat_last_n", defaults.sampling.penalty_last_n);
params.sampling.penalty_repeat = json_value(data, "repeat_penalty", defaults.sampling.penalty_repeat);
params.sampling.penalty_freq = json_value(data, "frequency_penalty", defaults.sampling.penalty_freq);
params.sampling.penalty_present = json_value(data, "presence_penalty", defaults.sampling.penalty_present);
params.sampling.dry_multiplier = json_value(data, "dry_multiplier", defaults.sampling.dry_multiplier);
params.sampling.dry_base = json_value(data, "dry_base", defaults.sampling.dry_base);
params.sampling.dry_allowed_length = json_value(data, "dry_allowed_length", defaults.sampling.dry_allowed_length);
params.sampling.dry_penalty_last_n = json_value(data, "dry_penalty_last_n", defaults.sampling.dry_penalty_last_n);
params.sampling.mirostat = json_value(data, "mirostat", defaults.sampling.mirostat);
params.sampling.mirostat_tau = json_value(data, "mirostat_tau", defaults.sampling.mirostat_tau);
params.sampling.mirostat_eta = json_value(data, "mirostat_eta", defaults.sampling.mirostat_eta);
params.sampling.adaptive_target = json_value(data, "adaptive_target", defaults.sampling.adaptive_target);
params.sampling.adaptive_decay = json_value(data, "adaptive_decay", defaults.sampling.adaptive_decay);
params.sampling.seed = json_value(data, "seed", defaults.sampling.seed);
params.sampling.n_probs = json_value(data, "n_probs", defaults.sampling.n_probs);
params.sampling.min_keep = json_value(data, "min_keep", defaults.sampling.min_keep);
params.sampling.backend_sampling = json_value(data, "backend_sampling", defaults.sampling.backend_sampling);
params.post_sampling_probs = json_value(data, "post_sampling_probs", defaults.post_sampling_probs);
params.speculative = defaults.speculative;
// TODO: to keep things simple, we disable speculative parameter adjustments for now
#if 0
// TODO: for now, be able to adjust only the draft-model based speculative parameters
params.speculative.draft.n_min = json_value(data, "speculative.n_min", defaults.speculative.draft.n_min);
params.speculative.draft.n_max = json_value(data, "speculative.n_max", defaults.speculative.draft.n_max);
params.speculative.draft.p_min = json_value(data, "speculative.p_min", defaults.speculative.draft.p_min);
params.speculative.draft.n_min = std::min(params.speculative.draft.n_max, params.speculative.draft.n_min);
params.speculative.draft.n_min = std::max(params.speculative.draft.n_min, 0);
params.speculative.draft.n_max = std::max(params.speculative.draft.n_max, 0);
// for debugging and research purposes
params.speculative.type = common_speculative_type_from_name(json_value(data, "speculative.type", common_speculative_type_to_str(defaults.speculative.type)));
params.speculative.ngram_size_n = json_value(data, "speculative.ngram_size_n", defaults.speculative.ngram_size_n);
params.speculative.ngram_size_m = json_value(data, "speculative.ngram_size_m", defaults.speculative.ngram_size_m);
params.speculative.ngram_min_hits = json_value(data, "speculative.ngram_m_hits", defaults.speculative.ngram_min_hits);
params.speculative.ngram_size_n = std::max(std::min(1, (int) params.speculative.ngram_size_n), 1024);
params.speculative.ngram_size_m = std::max(std::min(1, (int) params.speculative.ngram_size_m), 1024);
params.speculative.ngram_min_hits = std::max(std::min(1, (int) params.speculative.ngram_min_hits), 1024);
#endif
// Use OpenAI API logprobs only if n_probs wasn't provided
if (data.contains("logprobs") && params.sampling.n_probs == defaults.sampling.n_probs){
params.sampling.n_probs = json_value(data, "logprobs", defaults.sampling.n_probs);
}
if (data.contains("lora")) {
if (data.at("lora").is_array()) {
params.lora = parse_lora_request(data.at("lora"));
} else {
throw std::runtime_error("Error: 'lora' must be an array of objects with 'id' and 'scale' fields");
}
} else {
params.lora = {};
}
// TODO: add more sanity checks for the input parameters
if (params.sampling.penalty_last_n < -1) {
throw std::runtime_error("Error: repeat_last_n must be >= -1");
}
if (params.sampling.dry_penalty_last_n < -1) {
throw std::runtime_error("Error: dry_penalty_last_n must be >= -1");
}
if (params.sampling.penalty_last_n == -1) {
// note: should be the slot's context and not the full context, but it's ok
params.sampling.penalty_last_n = n_ctx_slot;
}
if (params.sampling.dry_penalty_last_n == -1) {
params.sampling.dry_penalty_last_n = n_ctx_slot;
}
if (params.sampling.dry_base < 1.0f) {
params.sampling.dry_base = defaults.sampling.dry_base;
}
// sequence breakers for DRY
{
// Currently, this is not compatible with TextGen WebUI, Koboldcpp and SillyTavern format
// Ref: https://github.com/oobabooga/text-generation-webui/blob/d1af7a41ade7bd3c3a463bfa640725edb818ebaf/extensions/openai/typing.py#L39
if (data.contains("dry_sequence_breakers")) {
params.sampling.dry_sequence_breakers = json_value(data, "dry_sequence_breakers", std::vector<std::string>());
if (params.sampling.dry_sequence_breakers.empty()) {
throw std::runtime_error("Error: dry_sequence_breakers must be a non-empty array of strings");
}
}
}
// process "json_schema" and "grammar"
if (data.contains("json_schema") && !data.contains("grammar")) {
try {
auto schema = json_value(data, "json_schema", json::object());
SRV_DBG("JSON schema: %s\n", schema.dump(2).c_str());
std::string grammar_str = json_schema_to_grammar(schema);
SRV_DBG("Converted grammar: %s\n", grammar_str.c_str());
params.sampling.grammar = {COMMON_GRAMMAR_TYPE_OUTPUT_FORMAT, std::move(grammar_str)};
} catch (const std::exception & e) {
throw std::runtime_error(std::string("\"json_schema\": ") + e.what());
}
} else {
params.sampling.grammar = defaults.sampling.grammar;
std::string grammar_str = json_value(data, "grammar", std::string());
if (!grammar_str.empty()) {
// grammar_type key is set by the server when converting chat template grammars
std::string grammar_type = json_value(data, "grammar_type", std::string());
if (grammar_type == "tool_calls") {
params.sampling.grammar = {COMMON_GRAMMAR_TYPE_TOOL_CALLS, std::move(grammar_str)};
} else {
// explicit grammar from the user (API field "grammar")
params.sampling.grammar = {COMMON_GRAMMAR_TYPE_USER, std::move(grammar_str)};
}
SRV_DBG("Grammar (%s): %s\n", grammar_type.c_str(), common_grammar_value(params.sampling.grammar).c_str());
}
params.sampling.grammar_lazy = json_value(data, "grammar_lazy", defaults.sampling.grammar_lazy);
SRV_DBG("Grammar lazy: %s\n", params.sampling.grammar_lazy ? "true" : "false");
}
{
auto it = data.find("chat_format");
if (it != data.end()) {
params.chat_parser_params.format = static_cast<common_chat_format>(it->get<int>());
SRV_INF("Chat format: %s\n", common_chat_format_name(params.chat_parser_params.format));
} else {
params.chat_parser_params.format = defaults.chat_parser_params.format;
}
common_reasoning_format reasoning_format = params_base.reasoning_format;
if (data.contains("reasoning_format")) {
reasoning_format = common_reasoning_format_from_name(data.at("reasoning_format").get<std::string>());
}
params.chat_parser_params.reasoning_format = reasoning_format;
params.chat_parser_params.reasoning_in_content = params.stream && (reasoning_format == COMMON_REASONING_FORMAT_DEEPSEEK_LEGACY);
params.chat_parser_params.generation_prompt = json_value(data, "generation_prompt", std::string());
params.sampling.generation_prompt = params.chat_parser_params.generation_prompt;
SRV_DBG("Generation prompt: '%s'\n", params.chat_parser_params.generation_prompt.c_str());
params.chat_parser_params.parse_tool_calls = json_value(data, "parse_tool_calls", false);
if (data.contains("chat_parser")) {
params.chat_parser_params.parser.load(data.at("chat_parser").get<std::string>());
}
if (data.contains("continue_final_message")) {
auto continuation = common_chat_continuation_parse(data.at("continue_final_message"));
params.chat_parser_params.is_continuation = continuation != COMMON_CHAT_CONTINUATION_NONE;
}
params.chat_parser_params.echo = json_value(data, "echo", false);
}
{
const auto preserved_tokens = data.find("preserved_tokens");
if (preserved_tokens != data.end()) {
for (const auto & t : *preserved_tokens) {
auto ids = common_tokenize(vocab, t.get<std::string>(), /* add_special= */ false, /* parse_special= */ true);
if (ids.size() == 1) {
SRV_DBG("Preserved token: %d\n", ids[0]);
params.sampling.preserved_tokens.insert(ids[0]);
} else {
// This may happen when using a tool call style meant for a model with special tokens to preserve on a model without said tokens.
SRV_DBG("Not preserved because more than 1 token: %s\n", t.get<std::string>().c_str());
}
}
}
const auto grammar_triggers = data.find("grammar_triggers");
if (grammar_triggers != data.end()) {
for (const auto & t : *grammar_triggers) {
server_grammar_trigger ct(t);
if (ct.value.type == COMMON_GRAMMAR_TRIGGER_TYPE_WORD) {
const auto & word = ct.value.value;
auto ids = common_tokenize(vocab, word, /* add_special= */ false, /* parse_special= */ true);
if (ids.size() == 1) {
auto token = ids[0];
if (std::find(params.sampling.preserved_tokens.begin(), params.sampling.preserved_tokens.end(), (llama_token) token) == params.sampling.preserved_tokens.end()) {
throw std::runtime_error("Grammar trigger word should be marked as preserved token: " + word);
}
SRV_DBG("Grammar trigger token: %d (`%s`)\n", token, word.c_str());
common_grammar_trigger trigger;
trigger.type = COMMON_GRAMMAR_TRIGGER_TYPE_TOKEN;
trigger.value = word;
trigger.token = token;
params.sampling.grammar_triggers.push_back(std::move(trigger));
} else {
SRV_DBG("Grammar trigger word: `%s`\n", word.c_str());
params.sampling.grammar_triggers.push_back({COMMON_GRAMMAR_TRIGGER_TYPE_WORD, word});
}
} else {
if (ct.value.type == COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN) {
SRV_DBG("Grammar trigger pattern: `%s`\n", ct.value.value.c_str());
} else if (ct.value.type == COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN_FULL) {
SRV_DBG("Grammar trigger pattern full: `%s`\n", ct.value.value.c_str());
} else {
throw std::runtime_error("Unknown grammar trigger type");
}
params.sampling.grammar_triggers.emplace_back(std::move(ct.value));
}
}
}
if (params.sampling.grammar_lazy && params.sampling.grammar_triggers.empty()) {
throw std::runtime_error("Error: no triggers set for lazy grammar!");
}
}
// Parse reasoning budget sampler parameters
{
const int32_t budget = json_value(data, "reasoning_budget_tokens", (int32_t) -1);
const auto start_tag = json_value(data, "reasoning_budget_start_tag", std::string());
const auto end_tag = json_value(data, "reasoning_budget_end_tag", std::string());
const auto message = json_value(data, "reasoning_budget_message", std::string());
params.sampling.reasoning_budget_tokens = budget;
params.sampling.reasoning_control = json_value(data, "reasoning_control", false);
if (!start_tag.empty()) {
params.sampling.reasoning_budget_start = common_tokenize(vocab, start_tag, false, true);
}
if (!end_tag.empty()) {
params.sampling.reasoning_budget_end = common_tokenize(vocab, end_tag, false, true);
params.sampling.reasoning_budget_forced = common_tokenize(vocab, message + end_tag, false, true);
SRV_DBG("reasoning budget: tokens=%d, generation_prompt='%s', start=%zu toks, end=%zu toks, forced=%zu toks\n",
budget, params.sampling.generation_prompt.c_str(),
params.sampling.reasoning_budget_start.size(),
params.sampling.reasoning_budget_end.size(),
params.sampling.reasoning_budget_forced.size());
}
}
{
params.sampling.logit_bias.clear();
const auto & logit_bias = data.find("logit_bias");
if (logit_bias != data.end() && logit_bias->is_array()) {
const int n_vocab = llama_vocab_n_tokens(vocab);
for (const auto & el : *logit_bias) {
// TODO: we may want to throw errors here, in case "el" is incorrect
if (el.is_array() && el.size() == 2) {
float bias;
if (el[1].is_number()) {
bias = el[1].get<float>();
} else if (el[1].is_boolean() && !el[1].get<bool>()) {
bias = -INFINITY;
} else {
continue;
}
if (el[0].is_number_integer()) {
llama_token tok = el[0].get<llama_token>();
if (tok >= 0 && tok < n_vocab) {
params.sampling.logit_bias.push_back({tok, bias});
}
} else if (el[0].is_string()) {
auto toks = common_tokenize(vocab, el[0].get<std::string>(), false);
for (auto tok : toks) {
params.sampling.logit_bias.push_back({tok, bias});
}
}
}
}
} else if (logit_bias != data.end() && logit_bias->is_object()) {
const int n_vocab = llama_vocab_n_tokens(vocab);
for (const auto & el : logit_bias->items()) {
float bias;
const auto & key = el.key();
const auto & value = el.value();
if (value.is_number()) {
bias = value.get<float>();
} else if (value.is_boolean() && !value.get<bool>()) {
bias = -INFINITY;
} else {
continue;
}
char *end;
llama_token tok = strtol(key.c_str(), &end, 10);
if (*end == 0) {
if (tok >= 0 && tok < n_vocab) {
params.sampling.logit_bias.push_back({tok, bias});
}
} else {
auto toks = common_tokenize(vocab, key, false);
for (auto tok : toks) {
params.sampling.logit_bias.push_back({tok, bias});
}
}
}
}
params.sampling.ignore_eos = json_value(data, "ignore_eos", params_base.sampling.ignore_eos);
if (params.sampling.ignore_eos) {
params.sampling.logit_bias.insert(
params.sampling.logit_bias.end(),
logit_bias_eog.begin(), logit_bias_eog.end());
}
}
{
params.antiprompt.clear();
const auto & stop = data.find("stop");
if (stop != data.end() && stop->is_array()) {
for (const auto & word : *stop) {
if (!word.empty()) {
params.antiprompt.push_back(word);
}
}
}
// set reverse prompt from cli args if not set in the request
if (params.antiprompt.empty()) {
params.antiprompt = defaults.antiprompt;
}
}
{
const auto samplers = data.find("samplers");
if (samplers != data.end()) {
if (samplers->is_array()) {
params.sampling.samplers = common_sampler_types_from_names(*samplers);
} else if (samplers->is_string()){
params.sampling.samplers = common_sampler_types_from_chars(samplers->get<std::string>());
}
} else {
params.sampling.samplers = defaults.sampling.samplers;
}
}
if (params.n_cmpl > params_base.n_parallel) {
throw std::runtime_error("n_cmpl cannot be greater than the number of slots, please increase -np");
}
return params;
}
//
// result_timings
//
+12 -7
View File
@@ -210,13 +210,6 @@ struct server_task {
}
}
static task_params params_from_json_cmpl(
const llama_vocab * vocab,
const common_params & params_base,
const int n_ctx_slot,
const std::vector<llama_logit_bias> & logit_bias_eog,
const json & data);
// utility function
static std::unordered_set<int> get_list_id(const std::vector<server_task> & tasks) {
std::unordered_set<int> ids(tasks.size());
@@ -312,6 +305,9 @@ struct server_task_result {
}
virtual json to_json() = 0;
virtual ~server_task_result() = default;
virtual server_task_result * clone() const {
GGML_ABORT("not implemented for this task type");
}
};
// using shared_ptr for polymorphism of server_task_result
@@ -649,3 +645,12 @@ struct server_prompt_cache {
void update();
};
// used exclusively by router mode
struct server_task_result_router : server_task_result {
json data;
virtual json to_json() override { return data; }
virtual server_task_result * clone() const override {
return new server_task_result_router(*this);
}
};
+14
View File
@@ -174,8 +174,11 @@ int llama_server(int argc, char ** argv) {
routes.get_props = models_routes->get_router_props;
routes.get_models = models_routes->get_router_models;
ctx_http.post("/models", ex_wrapper(models_routes->post_router_models));
ctx_http.post("/models/load", ex_wrapper(models_routes->post_router_models_load));
ctx_http.post("/models/unload", ex_wrapper(models_routes->post_router_models_unload));
ctx_http.get ("/models/sse", ex_wrapper(models_routes->get_router_models_sse));
ctx_http.del ("/models", ex_wrapper(models_routes->del_router_models));
}
ctx_http.get ("/health", ex_wrapper(routes.get_health)); // public endpoint (no API key check)
@@ -261,6 +264,7 @@ int llama_server(int argc, char ** argv) {
clean_up = [&models_routes]() {
SRV_INF("%s: cleaning up before exit...\n", __func__);
if (models_routes.has_value()) {
models_routes->stopping.store(true); // maybe redundant, but just to be safe
models_routes->models.unload_all();
}
llama_backend_free();
@@ -274,6 +278,10 @@ int llama_server(int argc, char ** argv) {
ctx_http.is_ready.store(true);
shutdown_handler = [&](int) {
if (models_routes.has_value()) {
// important to disconnect any SSE clients
models_routes->stopping.store(true);
}
ctx_http.stop();
};
@@ -341,6 +349,12 @@ int llama_server(int argc, char ** argv) {
SRV_INF("router server is listening on %s\n", ctx_http.listening_address.c_str());
SRV_WRN("%s", "NOTE: router mode is experimental\n");
SRV_WRN("%s", " it is not recommended to use this mode in untrusted environments\n");
if (!params.models_preset_hf.empty()) {
SRV_WRN( "NOTE: using preset.ini from HF repo '%s'\n", params.models_preset_hf.c_str());
SRV_WRN("%s", " please only use presets that you can trust! Unknown presets may be unsafe\n");
}
if (ctx_http.thread.joinable()) {
ctx_http.thread.join(); // keep the main thread alive
}
@@ -307,6 +307,20 @@ def test_completion_with_grammar(jinja: bool, grammar: str, n_predicted: int, re
assert match_regex(re_content, choice["message"]["content"]), choice["message"]["content"]
def test_completion_with_invalid_grammar():
global server
server.start()
res = server.make_request("POST", "/chat/completions", data={
"max_tokens": 8,
"messages": [
{"role": "user", "content": "Does not matter what I say, does it?"},
],
"grammar": "root ::= this is (not valid GBNF",
})
assert res.status_code == 400, res.body
assert "error" in res.body
@pytest.mark.parametrize("messages", [
None,
"string",
+96
View File
@@ -1,3 +1,4 @@
import threading
import pytest
from utils import *
@@ -253,3 +254,98 @@ def test_router_reload_models():
assert "model-reload-c" in ids, "newly added model should appear"
finally:
os.remove(preset_path)
MODEL_DOWNLOAD_ID = "ggml-org/test-model-router-download:F16"
MODEL_DOWNLOAD_TIMEOUT = 300
def _listen_sse(server: ServerProcess, collected: list, stop: threading.Event):
"""Collect /models/sse events into `collected` until `stop` is set."""
url = f"http://{server.server_host}:{server.server_port}/models/sse"
try:
with requests.get(url, stream=True, timeout=MODEL_DOWNLOAD_TIMEOUT) as resp:
for line_bytes in resp.iter_lines():
if stop.is_set():
break
line = line_bytes.decode("utf-8")
if line.startswith("data: "):
collected.append(json.loads(line[6:]))
except Exception:
pass
def _wait_for_sse_event(collected: list, event_type: str, model: str, timeout: int) -> bool:
deadline = time.time() + timeout
while time.time() < deadline:
if any(e.get("event") == event_type and e.get("model") == model for e in collected):
return True
time.sleep(0.5)
return False
def test_router_download_model():
"""Case 1: download a model, verify SSE events and GET /models."""
global server
server.start()
# Ensure the model is not present before we start
server.make_request("DELETE", f"/models?model={MODEL_DOWNLOAD_ID}")
sse_events: list = []
stop = threading.Event()
sse_thread = threading.Thread(
target=_listen_sse, args=(server, sse_events, stop), daemon=True
)
sse_thread.start()
# Trigger the download
res = server.make_request("POST", "/models", data={"model": MODEL_DOWNLOAD_ID})
assert res.status_code == 200
assert res.body.get("success") is True
# Wait for download_finished SSE event
finished = _wait_for_sse_event(
sse_events, "download_finished", MODEL_DOWNLOAD_ID, MODEL_DOWNLOAD_TIMEOUT
)
stop.set()
assert finished, "Never received download_finished SSE event"
assert any(
e.get("event") == "download_progress" and e.get("model") == MODEL_DOWNLOAD_ID
for e in sse_events
), "No download_progress events received"
# Model should now appear in GET /models
ids = _get_model_ids(is_reload=False)
assert MODEL_DOWNLOAD_ID in ids, f"{MODEL_DOWNLOAD_ID} not found in /models after download"
def test_router_delete_model():
"""Case 2: delete the downloaded model, verify it disappears from GET /models."""
global server
server.start()
# Ensure the model exists (download it if needed)
if MODEL_DOWNLOAD_ID not in _get_model_ids(is_reload=False):
res = server.make_request("POST", "/models", data={"model": MODEL_DOWNLOAD_ID})
assert res.status_code == 200
sse_events: list = []
stop = threading.Event()
threading.Thread(
target=_listen_sse, args=(server, sse_events, stop), daemon=True
).start()
finished = _wait_for_sse_event(
sse_events, "download_finished", MODEL_DOWNLOAD_ID, MODEL_DOWNLOAD_TIMEOUT
)
stop.set()
assert finished, "Model did not finish downloading before delete test"
# Delete the model
del_res = server.make_request("DELETE", f"/models?model={MODEL_DOWNLOAD_ID}")
assert del_res.status_code == 200
assert del_res.body.get("success") is True
# Model should no longer appear in GET /models
ids = _get_model_ids(is_reload=False)
assert MODEL_DOWNLOAD_ID not in ids, f"{MODEL_DOWNLOAD_ID} still present after deletion"
+3
View File
@@ -340,6 +340,9 @@ class ServerProcess:
elif method == "POST":
response = requests.post(url, headers=headers, json=data, timeout=timeout)
parse_body = True
elif method == "DELETE":
response = requests.delete(url, headers=headers, timeout=timeout)
parse_body = True
elif method == "OPTIONS":
response = requests.options(url, headers=headers, timeout=timeout)
else:

Some files were not shown because too many files have changed in this diff Show More