Compare commits

...

19 Commits

Author SHA1 Message Date
shaofeiqi 5c7c22c3e1 opencl: flush profiling batch at shutdown for incomplete batches (#25016) 2026-06-25 18:48:24 -07:00
Sigbjørn Skjæret beac5309f1 xcframework : disable mtmd video on i/tv/visionos (#25018) 2026-06-26 00:13:59 +02:00
Tarek Dakhran 9d5d882d8c model : Add label for LFM2.5-230M (#25008) 2026-06-25 18:58:52 +02:00
Oliver Simons 1ec44d178d CUDA: Various fixes to cpy.cu (#25000)
* Add failing test-case to test-backend-ops

Extracted from https://github.com/ggml-org/llama.cpp/issues/24072

* Minimize repro with help of AI

N = 8 * (65535 - 1) + 1 = 524273

* Port and adjust workaround from https://github.com/LostRuins/koboldcpp/commit/0ba798341e0c70517cb226cb63c966b086a3b5b3

Fall-back should share code, also relax y-z constraint to be inclusive

* Add test-case + fallback also for y dim

* Fix x-guards which is 2^{31}-1, so inlusive of INT_MAX

* Fix overflow problems for transposed copy kernel
2026-06-25 17:29:23 +02:00
Xuan-Son Nguyen c7cddefcbd misc: fix labeler (#25012) 2026-06-25 17:23:37 +02:00
Xuan-Son Nguyen e9d1b76d0a server: use status code 403 for disabled features (#24970)
* server: use status code 403 for disabled features

* cont

* fix test case
2026-06-25 16:36:40 +02:00
Xuan-Son Nguyen 099bf06952 misc: update lables (#24920)
* misc: update lables

* bring back examples, add mtmd
2026-06-25 16:26:56 +02:00
Xuan-Son Nguyen 60bc8866b1 common: refactor model handling (#24980)
* common: refactor models handling

* remote preset

* cont

* rm skip_download option

* missing header

* fix plan.model_files

* fix --offline case

* move hf_plan to download

* refactor

* rm redundant curr_ex, add comments

* adapt
2026-06-25 15:17:51 +02:00
Kashif Rasul e8ecce53b8 docs : Eagle3 qwen3 draft model support (#24977)
* eagle3: accept Eagle3LlamaForCausalLM draft checkpoints

* docs: add eagle3 speculative decoding section

* docs: address eagle3 review comments

* docs: add more angelslim eagle3 models

* docs: add gpt-oss eagle3 models and link to pr 18039
2026-06-25 15:58:00 +03:00
Adrien Gallouët 683b04cc4a app : add the llama download subcommand (#24982)
* app : add the download command (with llama-download)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Remove llama-download tool for now

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-06-25 13:36:36 +02:00
fairydreaming f728adab68 ggml : address integer overflows in binary ops CUDA implementation (#24706)
* ggml : address integer overflows in binary ops CUDA implementation

* ggml : add size_t casts to avoid integer overflows

* ggml : add more asserts checking integer overflows in binary ops CUDA implementation

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2026-06-25 10:06:44 +02:00
Pascal 3e61ea0e2f ui: fix always-show-sidebar-on-desktop setting after navigation refactor (#24979) 2026-06-25 09:45:55 +02:00
Christopher Albert fdbd6abee2 tests : synchronize contexts at end of test-thread-safety (#24935)
Assisted-by: Claude
2026-06-25 09:22:51 +03:00
Abraham Gonzalez e12a0128ab build: include libmtmd in Apple XCFramework (#21935)
Adds an opt-in LLAMA_BUILD_MTMD CMake option so build-xcframework.sh
can link libmtmd.a into the framework binary without pulling in the
rest of tools/ (which doesn't cross-build cleanly to iOS/tvOS/visionOS).

- CMakeLists.txt: new option, default OFF. When on with
  LLAMA_BUILD_TOOLS=OFF, only the tools/mtmd subdir is added. Useful
  for any binding that wants just libmtmd (Apple XCFramework, WASM).
- tools/mtmd/CMakeLists.txt: gate the CLI exe targets on
  LLAMA_BUILD_TOOLS. Gating on LLAMA_BUILD_COMMON is not enough — it
  defaults ON in standalone builds and visionOS xcodebuild then fails
  with "install TARGETS given no BUNDLE DESTINATION for MACOSX_BUNDLE
  executable target 'llama-mtmd-cli'".
- build-xcframework.sh: turn the option on, pass -DLLAMA_BUILD_MTMD,
  add libmtmd.a to combine_static_libraries, and copy mtmd.h and
  mtmd-helper.h into the framework Headers dir. The umbrella module
  map then exposes them, so Swift / Obj-C consumers can import the
  mtmd C API directly.

After this, nm on ios-arm64/llama.framework/llama shows 52 _mtmd_
symbols. Verified end-to-end: a Swift target links the produced
framework and calls mtmd_default_marker, mtmd_bitmap_init, etc.
without a shim on macos / iphoneos / iphonesimulator / xros slices.

Co-authored-by: Abraham Gonzalez <abraham@theabecaster.com>
2026-06-25 08:37:30 +03:00
Sigbjørn Skjæret b3ce5cedf4 quant : fix quantizing moe with mtp (#24986) 2026-06-25 08:36:49 +03:00
David Spruill e9fb3b3fc0 sycl : support --split-mode tensor (#24152)
* Sycl tp stage1 (#1)

* SYCL: tensor parallelism (--split-mode tensor) for dual-GPU

Adds the comm_init/comm_free/comm_allreduce_tensor trio that the
meta-backend queries via get_proc_address to enable backend-specific
all-reduce, mirroring the pattern used by ggml-cuda.cu.

For N=2 (the common dual-GPU case) implements a degenerate ring
all-reduce with two size-branched paths:

  * Small (nelem < 32768): FP32 direct memcpy + per-device ADD kernel
    chained via depends_on(memcpy_event). 4 SYCL submissions/call.

  * Large (nelem >= 32768): BF16-compressed. Each device compresses
    FP32 -> BF16 in a local outbox, cross-device memcpys to the peer's
    inbox (HALF the PCIe bytes), then decompresses + adds into the
    local FP32 partial. 6 SYCL submissions/call but PCIe bytes halved
    -- wins for any tensor where PCIe dominates kernel time.

Threshold and BF16 path pattern mirror the CUDA NCCL allreduce.

Storage: ONE persistent uint8_t buffer per device, 4 * nelem bytes
(matches both path layouts: FP32 nelem floats; BF16 outbox+inbox =
2 * nelem uint16_t each). Single alloc+free per device keeps the
SYCL pool's strict-LIFO invariant trivial.

Initial impl handles N=2 FP32 contiguous tensors. Other cases return
false, causing the meta-backend to use its generic butterfly fallback.

Per-call sync is intentionally omitted. SYCL in-order queue semantics
ensure that the meta-backend's next compute on the same per-device
queue waits for our final ADD, and the next allreduce's first op on
the same persistent buffer waits via the same queue. Only comm_free
does an explicit final wait.

OneCCL is NOT used: OneCCL 2021.17 hardcodes single-device-per-process
in communicator_impl.hpp:47 (condition devices.size() == 1), which is
incompatible with llama.cpp's single-process multi-GPU model.

Measured on dual Intel Arc Pro B70 (NEO 26.05.x, oneAPI 2025.3 +
DPC++ nightly):

  Llama-3.3-70B Q4_K_M, -sm tensor -fa 1 -ctk f16 -ctv f16:
    pp512 = 377.08 t/s  (vs 313.65 layer mode = +20.2%)
    tg128 = 17.40 t/s   (vs   9.74 layer mode = +78.6%)

  Qwen3-Coder-Next-80B-A3B Q3_K_M (MoE):
    pp512 = 216.56 t/s  (vs 156.58 meta-backend butterfly = +38.3%)
    tg128 = 17.60 t/s   (vs  14.31 meta-backend butterfly = +23.0%)

  Qwen3-4B Q4_K_M:
    pp64  = 984.51 t/s, tg16 = 49.29 t/s

Llama-3.3-70B in SYCL TP now comfortably beats production layer mode
on both prefill and decode. Coder-Next-80B-A3B (MoE) also wins on
both — the BF16 path is what unlocks the many-medium-allreduces
prefill pattern.

Build/CMake: no changes. No new dependencies. ~210 lines added across
ggml-sycl.h and ggml-sycl.cpp.

* Fix comments

* documentation update to address PR feedback

* Bring over my device-to-device memcpy chagnes

* move the dev2dev_memcpy calls to the upstream 7-parameter variety

* Fix a typo and remove a trailing whitespace
2026-06-25 08:35:21 +03:00
Neo Zhang 9c10954865 sycl : fix the failed UT cases of conv_3d (#24900) 2026-06-25 08:27:58 +03:00
lhez fdb2c11c70 opencl: support non-contig rows in norm (#24965) 2026-06-24 19:21:25 -07:00
Piotr Wilkin (ilintar) 09cedfd699 chat: harden caps check (#24973) 2026-06-25 02:49:22 +02:00
38 changed files with 993 additions and 492 deletions
+26 -19
View File
@@ -35,8 +35,20 @@ AMD ZenDNN:
documentation:
- changed-files:
- any-glob-to-any-file:
- "**/*.md"
- docs/**
- media/**
examples:
- all:
- changed-files:
- any-glob-to-any-file:
- app/**
- examples/**
- tools/**
- all-globs-to-all-files:
- '!tools/server/**'
- '!tools/mtmd/**'
- '!tools/ui/**'
testing:
- changed-files:
- any-glob-to-any-file:
@@ -47,28 +59,12 @@ build:
- cmake/**
- CMakeLists.txt
- CMakePresets.json
examples:
- changed-files:
- any-glob-to-any-file:
- examples/**
- tools/**
devops:
- changed-files:
- any-glob-to-any-file:
- .devops/**
- .github/**
- ci/**
python:
- changed-files:
- any-glob-to-any-file:
- "**/*.py"
- requirements/**
- gguf-py/**
- .flake8
script:
- changed-files:
- any-glob-to-any-file:
- scripts/**
android:
- changed-files:
- any-glob-to-any-file:
@@ -81,9 +77,20 @@ server:
- changed-files:
- any-glob-to-any-file:
- tools/server/**
mtmd:
- changed-files:
- any-glob-to-any-file:
- tools/mtmd/**
conversion:
- changed-files:
- any-glob-to-any-file:
- conversion/**
- convert_*.py
- gguf-py/**
vendor:
- changed-files:
- any-glob-to-any-file:
- vendor/**
ggml:
- changed-files:
- any-glob-to-any-file:
+10
View File
@@ -222,6 +222,16 @@ if (LLAMA_BUILD_APP)
add_subdirectory(app)
endif()
# Standalone libmtmd build without pulling in the rest of the tools/ tree.
# Useful when packaging just the mtmd library for language bindings (e.g. an
# Apple XCFramework, or a WASM build). When the full tools build is enabled,
# mtmd is already built by the tools/ subdirectory above; this hook only fires
# when LLAMA_BUILD_TOOLS is OFF to avoid double-adding the target.
option(LLAMA_BUILD_MTMD "llama: build tools/mtmd library standalone" OFF)
if (LLAMA_BUILD_MTMD AND NOT (LLAMA_BUILD_COMMON AND LLAMA_BUILD_TOOLS))
add_subdirectory(tools/mtmd)
endif()
#
# install
#
+1 -1
View File
@@ -1,6 +1,6 @@
set(TARGET llama-app)
add_executable(${TARGET} llama.cpp)
add_executable(${TARGET} llama.cpp download.cpp)
set_target_properties(${TARGET} PROPERTIES OUTPUT_NAME llama)
target_link_libraries(${TARGET} PRIVATE
+71
View File
@@ -0,0 +1,71 @@
#include "arg.h"
#include "common.h"
#include "download.h"
#include "log.h"
#include <cstdio>
#include <filesystem>
static void print_usage(int /*argc*/, char ** argv) {
printf(
"\nexamples:\n"
" %s -hf ggml-org/gemma-3-4b-it-qat-GGUF\n"
" %s -hf ggml-org/gemma-3-4b-it-qat-GGUF:Q4_K_M\n"
" %s -hf ggml-org/models -hff model.gguf\n"
" %s -mu https://example.com/model.gguf -m model.gguf\n"
"\n",
argv[0], argv[0], argv[0], argv[0]
);
}
int llama_download(int argc, char ** argv);
int llama_download(int argc, char ** argv) {
common_init();
common_params params;
params.verbosity = LOG_LEVEL_ERROR;
if (!common_params_parse(argc, argv, params, LLAMA_EXAMPLE_DOWNLOAD, print_usage)) {
return 1;
}
const bool has_source = !params.model.hf_repo.empty() || !params.model.url.empty() ||
!params.model.path.empty() || !params.model.docker_repo.empty();
if (!has_source) {
fprintf(stderr, "error: no model source specified (use --hf-repo, --model-url, --model or --docker-repo)\n");
return 1;
}
try {
common_models_handler handler = common_models_handler_init(params, LLAMA_EXAMPLE_DOWNLOAD);
common_models_handler_apply(handler, params);
} catch (const std::exception & e) {
fprintf(stderr, "error: %s\n", e.what());
return 1;
}
if (!params.models_preset.empty()) {
// -hf pointed at a preset repo: print the preset path and stop
printf("%s\n", params.models_preset.c_str());
return 0;
}
if (params.model.path.empty()) {
fprintf(stderr, "error: model download failed\n");
return 1;
}
if (!std::filesystem::exists(params.model.path)) {
fprintf(stderr, "error: model file does not exist: %s\n", params.model.path.c_str());
return 1;
}
printf("%s\n", params.model.path.c_str());
if (!params.mmproj.path.empty()) {
printf("%s\n", params.mmproj.path.c_str());
}
if (!params.speculative.draft.mparams.path.empty()) {
printf("%s\n", params.speculative.draft.mparams.path.c_str());
}
return 0;
}
+2
View File
@@ -19,6 +19,7 @@ int llama_batched_bench(int argc, char ** argv);
int llama_fit_params(int argc, char ** argv);
int llama_quantize(int argc, char ** argv);
int llama_perplexity(int argc, char ** argv);
int llama_download(int argc, char ** argv);
// Self-update is only supported for binaries built with llama-install.sh
static int llama_update(int argc, char ** argv) {
@@ -61,6 +62,7 @@ static const command cmds[] = {
{"serve", "HTTP API server", {"server"}, false, llama_server },
{"cli", "Command-line interactive interface", {"client"}, false, llama_cli },
{"update", "Update llama to the latest release", {}, UPDATE_HIDDEN, llama_update },
{"download", "Download a model", {"get"}, false, llama_download },
{"completion", "Text completion", {"complete"}, true, llama_completion },
{"bench", "Benchmark prompt processing and text generation", {}, true, llama_bench },
{"batched-bench", "Benchmark batched decoding performance", {}, true, llama_batched_bench},
+11
View File
@@ -13,6 +13,7 @@ LLAMA_BUILD_EXAMPLES=OFF
LLAMA_BUILD_TOOLS=OFF
LLAMA_BUILD_TESTS=OFF
LLAMA_BUILD_SERVER=OFF
LLAMA_BUILD_MTMD=ON
GGML_METAL=ON
GGML_METAL_EMBED_LIBRARY=ON
GGML_BLAS_DEFAULT=ON
@@ -39,6 +40,7 @@ COMMON_CMAKE_ARGS=(
-DLLAMA_BUILD_TOOLS=${LLAMA_BUILD_TOOLS}
-DLLAMA_BUILD_TESTS=${LLAMA_BUILD_TESTS}
-DLLAMA_BUILD_SERVER=${LLAMA_BUILD_SERVER}
-DLLAMA_BUILD_MTMD=${LLAMA_BUILD_MTMD}
-DGGML_METAL_EMBED_LIBRARY=${GGML_METAL_EMBED_LIBRARY}
-DGGML_BLAS_DEFAULT=${GGML_BLAS_DEFAULT}
-DGGML_METAL=${GGML_METAL}
@@ -126,6 +128,8 @@ setup_framework_structure() {
cp ggml/include/ggml-cpu.h ${header_path}
cp ggml/include/ggml-blas.h ${header_path}
cp ggml/include/gguf.h ${header_path}
cp tools/mtmd/mtmd.h ${header_path}
cp tools/mtmd/mtmd-helper.h ${header_path}
# Create module map (common for all platforms)
cat > ${module_path}module.modulemap << EOF
@@ -247,6 +251,7 @@ combine_static_libraries() {
"${base_dir}/${build_dir}/ggml/src/${release_dir}/libggml-cpu.a"
"${base_dir}/${build_dir}/ggml/src/ggml-metal/${release_dir}/libggml-metal.a"
"${base_dir}/${build_dir}/ggml/src/ggml-blas/${release_dir}/libggml-blas.a"
"${base_dir}/${build_dir}/tools/mtmd/${release_dir}/libmtmd.a"
)
# Create temporary directory for processing
@@ -410,6 +415,7 @@ cmake -B build-ios-sim -G Xcode \
-DCMAKE_C_FLAGS="${COMMON_C_FLAGS}" \
-DCMAKE_CXX_FLAGS="${COMMON_CXX_FLAGS}" \
-DLLAMA_OPENSSL=OFF \
-DMTMD_VIDEO=OFF \
-S .
cmake --build build-ios-sim --config Release -j $(sysctl -n hw.logicalcpu) -- -quiet
@@ -424,6 +430,7 @@ cmake -B build-ios-device -G Xcode \
-DCMAKE_C_FLAGS="${COMMON_C_FLAGS}" \
-DCMAKE_CXX_FLAGS="${COMMON_CXX_FLAGS}" \
-DLLAMA_OPENSSL=OFF \
-DMTMD_VIDEO=OFF \
-S .
cmake --build build-ios-device --config Release -j $(sysctl -n hw.logicalcpu) -- -quiet
@@ -450,6 +457,7 @@ cmake -B build-visionos -G Xcode \
-DCMAKE_CXX_FLAGS="${COMMON_CXX_FLAGS}" \
-DLLAMA_OPENSSL=OFF \
-DLLAMA_BUILD_SERVER=OFF \
-DMTMD_VIDEO=OFF \
-S .
cmake --build build-visionos --config Release -j $(sysctl -n hw.logicalcpu) -- -quiet
@@ -465,6 +473,7 @@ cmake -B build-visionos-sim -G Xcode \
-DCMAKE_CXX_FLAGS="${COMMON_CXX_FLAGS}" \
-DLLAMA_OPENSSL=OFF \
-DLLAMA_BUILD_SERVER=OFF \
-DMTMD_VIDEO=OFF \
-S .
cmake --build build-visionos-sim --config Release -j $(sysctl -n hw.logicalcpu) -- -quiet
@@ -481,6 +490,7 @@ cmake -B build-tvos-sim -G Xcode \
-DCMAKE_C_FLAGS="${COMMON_C_FLAGS}" \
-DCMAKE_CXX_FLAGS="${COMMON_CXX_FLAGS}" \
-DLLAMA_OPENSSL=OFF \
-DMTMD_VIDEO=OFF \
-S .
cmake --build build-tvos-sim --config Release -j $(sysctl -n hw.logicalcpu) -- -quiet
@@ -496,6 +506,7 @@ cmake -B build-tvos-device -G Xcode \
-DCMAKE_C_FLAGS="${COMMON_C_FLAGS}" \
-DCMAKE_CXX_FLAGS="${COMMON_CXX_FLAGS}" \
-DLLAMA_OPENSSL=OFF \
-DMTMD_VIDEO=OFF \
-S .
cmake --build build-tvos-device --config Release -j $(sysctl -n hw.logicalcpu) -- -quiet
+212 -123
View File
@@ -297,60 +297,6 @@ struct handle_model_result {
std::string preset_path;
};
static handle_model_result common_params_handle_model(struct common_params_model & model,
const common_download_opts & opts) {
handle_model_result result;
// TODO @ngxson : refactor this into a new common_model_download_context
if (!model.docker_repo.empty()) {
model.path = common_docker_resolve_model(model.docker_repo);
} else if (!model.hf_repo.empty()) {
// If -m was used with -hf, treat the model "path" as the hf_file to download
if (model.hf_file.empty() && !model.path.empty()) {
model.hf_file = model.path;
model.path = "";
}
common_download_opts hf_opts = opts;
auto download_result = common_download_model(model, hf_opts);
if (!download_result.preset_path.empty()) {
result.found_preset = true;
result.preset_path = download_result.preset_path;
return result; // skip everything else if preset.ini is used
}
if (download_result.model_path.empty()) {
throw std::runtime_error("failed to download model from Hugging Face");
}
model.path = download_result.model_path;
if (!download_result.mmproj_path.empty()) {
result.found_mmproj = true;
result.mmproj.path = download_result.mmproj_path;
}
if (!download_result.mtp_path.empty()) {
result.found_mtp = true;
result.mtp.path = download_result.mtp_path;
}
} else if (!model.url.empty()) {
if (model.path.empty()) {
auto f = string_split<std::string>(model.url, '#').front();
f = string_split<std::string>(f, '?').front();
model.path = fs_get_cache_file(string_split<std::string>(f, '/').back());
}
auto download_result = common_download_model(model, opts);
if (download_result.model_path.empty()) {
throw std::runtime_error("failed to download model from " + model.url);
}
}
return result;
}
const std::vector<ggml_type> kv_cache_types = {
GGML_TYPE_F32,
GGML_TYPE_F16,
@@ -395,77 +341,204 @@ static bool parse_bool_value(const std::string & value) {
}
//
// CLI argument parsing functions
// common_models_handler
//
bool common_params_handle_models(common_params & params, llama_example curr_ex, const common_params_handle_models_params & handle_params) {
const bool spec_type_draft_mtp = std::find(params.speculative.types.begin(),
params.speculative.types.end(),
COMMON_SPECULATIVE_TYPE_DRAFT_MTP) != params.speculative.types.end();
static std::string get_default_local_path(const std::string & url) {
auto f = string_split<std::string>(url, '#').front();
f = string_split<std::string>(f, '?').front();
return fs_get_cache_file(string_split<std::string>(f, '/').back());
}
common_models_handler common_models_handler_init(const common_params & params, llama_example curr_ex) {
common_download_hf_plan plan;
common_download_opts opts;
const bool spec_type_draft_mtp = std::find(params.speculative.types.begin(),
params.speculative.types.end(),
COMMON_SPECULATIVE_TYPE_DRAFT_MTP) != params.speculative.types.end();
// only download mmproj if the current example is using it
bool use_mmproj = false;
for (const auto & ex : mmproj_examples) {
if (curr_ex == ex) {
use_mmproj = true;
break;
}
}
opts.bearer_token = params.hf_token;
opts.offline = params.offline;
opts.skip_download = params.skip_download;
opts.download_mtp = spec_type_draft_mtp;
opts.download_mmproj = !params.no_mmproj && params.mmproj.path.empty() && params.mmproj.url.empty();
opts.preset_only = handle_params.preset_only;
opts.download_mmproj = use_mmproj && !params.no_mmproj
&& params.mmproj.path.empty() && params.mmproj.url.empty();
if (handle_params.callback) {
opts.callback = handle_params.callback;
if (!params.model.hf_repo.empty()) {
plan = common_download_get_hf_plan(params.model, opts);
}
// sub-models (draft, mmproj, vocoder) are explicitly specified by the user,
// so we should not auto-discover mtp/mmproj siblings for them
common_download_opts sub_opts = opts;
sub_opts.download_mtp = false;
sub_opts.download_mmproj = false;
return common_models_handler{plan, opts};
}
try {
auto res = common_params_handle_model(params.model, opts);
if (res.found_preset) {
if (!params.models_preset.empty()) {
throw std::invalid_argument("cannot use both --models-preset and -hf with a preset.ini file");
bool common_models_handler_is_preset_repo(const common_models_handler & handler) {
return !handler.plan.preset.url.empty();
}
static std::vector<common_download_task> build_url_tasks(const common_params_model & model, common_download_opts opts) {
auto parts = common_download_get_all_parts(model.url);
std::vector<common_download_task> tasks;
// single-part: download straight to model.path if the user gave one (-m), else the cache default
if (parts.size() == 1) {
common_download_task task;
task.url = parts[0];
task.local_path = model.path.empty() ? get_default_local_path(parts[0]) : model.path;
task.opts = opts;
tasks.push_back(std::move(task));
return tasks;
}
// multi-part: place each part under the user's -m directory (if given), else the cache default
std::string base_dir;
if (!model.path.empty()) {
auto pos = model.path.rfind('/');
base_dir = pos == std::string::npos ? std::string(".") : model.path.substr(0, pos);
}
for (const auto & part : parts) {
common_download_task task;
task.url = part;
task.opts = opts;
std::string local = get_default_local_path(part);
if (!base_dir.empty()) {
auto pos = local.rfind('/');
std::string name = pos == std::string::npos ? local : local.substr(pos + 1);
local = base_dir + "/" + name;
}
task.local_path = local;
tasks.push_back(std::move(task));
}
return tasks;
}
void common_models_handler_apply(common_models_handler & handler, common_params & params, common_download_callback * callback) {
std::vector<common_download_task> tasks;
auto & plan = handler.plan;
auto opts = handler.opts; // copy
opts.callback = callback;
// handle plain "url" if needed
auto handle_url = [&](common_params_model & model) {
if (!model.url.empty()) {
if (model.path.empty()) {
model.path = get_default_local_path(model.url);
}
}
};
handle_url(params.model);
handle_url(params.mmproj);
handle_url(params.vocoder.model);
handle_url(params.speculative.draft.mparams);
// optionally, if docker repo is set, resolve it
if (!params.model.docker_repo.empty()) {
params.model.url = common_docker_resolve_model(params.model.docker_repo);
params.model.path = get_default_local_path(params.model.url);
}
// handle plain "url" tasks (non-hf)
if (!params.model.url.empty()) {
auto url_tasks = build_url_tasks(params.model, opts);
// the first part is what gets loaded, so point params.model.path at it
if (!url_tasks.empty()) {
std::string first_path = url_tasks.front().local_path;
url_tasks.front().on_done = [&]() { params.model.path = first_path; };
}
for (auto & task : url_tasks) {
tasks.push_back(std::move(task));
}
}
if (!params.mmproj.url.empty()) {
common_download_task task;
task.url = params.mmproj.url;
task.local_path = params.mmproj.path;
task.opts = opts;
tasks.push_back(task);
}
if (!params.vocoder.model.url.empty()) {
common_download_task task;
task.url = params.vocoder.model.url;
task.local_path = params.vocoder.model.path;
task.opts = opts;
tasks.push_back(task);
}
if (!params.speculative.draft.mparams.url.empty()) {
common_download_task task;
task.url = params.speculative.draft.mparams.url;
task.local_path = params.speculative.draft.mparams.path;
task.opts = opts;
tasks.push_back(task);
}
// handle hf_plan tasks
if (!plan.model_files.empty()) {
for (size_t i = 0; i < plan.model_files.size(); ++i) {
auto & model_file = plan.model_files[i];
bool is_first = (i == 0);
tasks.emplace_back(model_file, opts, [&, is_first]() {
if (is_first) {
// only use first part as model path
params.model.path = hf_cache::finalize_file(model_file);
} else {
hf_cache::finalize_file(model_file);
}
});
}
}
if (!plan.mmproj.local_path.empty()) {
tasks.emplace_back(plan.mmproj, opts, [&]() {
params.mmproj.path = hf_cache::finalize_file(plan.mmproj);
});
}
if (!plan.mtp.local_path.empty()) {
tasks.emplace_back(plan.mtp, opts, [&]() {
// only fall back to the discovered MTP head when no draft was explicitly provided
if (params.speculative.draft.mparams.empty()) {
params.speculative.draft.mparams.path = hf_cache::finalize_file(plan.mtp);
} else {
hf_cache::finalize_file(plan.mtp);
}
});
}
if (!plan.preset.local_path.empty()) {
tasks.emplace_back(plan.preset, opts, [&]() {
// if HF repo is a preset repo, we simply run server in router mode with the preset.ini file
params.models_preset_hf = params.model.hf_repo; // only for showing a warning
params.models_preset = res.preset_path;
params.models_preset = hf_cache::finalize_file(plan.preset);
params.model = common_params_model{}; // make sure to clear model, so server starts in router mode
return true;
}
});
}
if (params.no_mmproj) {
params.mmproj = {};
} else if (res.found_mmproj && params.mmproj.path.empty() && params.mmproj.url.empty()) {
// optionally, handle mmproj model when -hf is specified
params.mmproj = res.mmproj;
}
// only download mmproj if the current example is using it
for (const auto & ex : mmproj_examples) {
if (curr_ex == ex) {
common_params_handle_model(params.mmproj, sub_opts);
break;
}
}
// run all tasks in parallel
if (!params.offline) {
common_download_run_tasks(tasks);
}
// when --spec-type mtp is set and no draft model was provided explicitly,
// fall back to the MTP head discovered alongside the -hf model
if (spec_type_draft_mtp && res.found_mtp &&
params.speculative.draft.mparams.path.empty() &&
params.speculative.draft.mparams.hf_repo.empty() &&
params.speculative.draft.mparams.url.empty()) {
params.speculative.draft.mparams.path = res.mtp.path;
// download successful, update params with the downloaded paths
for (const auto & task : tasks) {
if (task.on_done) {
task.on_done();
}
common_params_handle_model(params.speculative.draft.mparams, sub_opts);
common_params_handle_model(params.vocoder.model, sub_opts);
return true;
} catch (const common_skip_download_exception &) {
return false;
} catch (const std::exception &) {
throw;
}
}
//
// CLI argument parsing functions
//
static bool common_params_parse_ex(int argc, char ** argv, common_params_context & ctx_arg) {
common_params & params = ctx_arg.params;
@@ -594,12 +667,15 @@ static bool common_params_parse_ex(int argc, char ** argv, common_params_context
const bool skip_model_download =
// server will call common_params_handle_models() later, so we skip it here
ctx_arg.ex == LLAMA_EXAMPLE_SERVER ||
// download calls common_params_handle_models() itself and prints the paths
ctx_arg.ex == LLAMA_EXAMPLE_DOWNLOAD ||
// export_graph_ops loads only metadata
ctx_arg.ex == LLAMA_EXAMPLE_EXPORT_GRAPH_OPS;
if (!skip_model_download) {
// handle model and download
common_params_handle_models(params, ctx_arg.ex, {});
common_models_handler handler = common_models_handler_init(params, ctx_arg.ex);
common_models_handler_apply(handler, params);
// model is required (except for server)
// TODO @ngxson : maybe show a list of available models in CLI in this case
@@ -671,15 +747,19 @@ static void common_params_print_usage(common_params_context & ctx_arg) {
common_options.push_back(&opt);
}
}
printf("----- common params -----\n\n");
print_options(common_options);
printf("\n\n----- sampling params -----\n\n");
print_options(sampling_options);
printf("\n\n----- speculative params -----\n\n");
print_options(spec_options);
// TODO: maybe convert enum llama_example to string
printf("\n\n----- example-specific params -----\n\n");
print_options(specific_options);
bool first = true;
auto print_section = [&](const char * header, std::vector<common_arg *> & options) {
if (options.empty()) {
return;
}
printf("%s----- %s -----\n\n", first ? "" : "\n\n", header);
first = false;
print_options(options);
};
print_section("common params", common_options);
print_section("sampling params", sampling_options);
print_section("speculative params", spec_options);
print_section("example-specific params", specific_options);
}
static void common_params_print_completion(common_params_context & ctx_arg) {
@@ -1079,7 +1159,9 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
* - if both {LLAMA_EXAMPLE_COMMON, LLAMA_EXAMPLE_*,} are set, we will prioritize the LLAMA_EXAMPLE_* matching current example
*/
auto add_opt = [&](common_arg arg) {
if ((arg.in_example(ex) || arg.in_example(LLAMA_EXAMPLE_COMMON)) && !arg.is_exclude(ex)) {
// download only exposes the handful of args explicitly tagged for it
const bool inherit_common = ex != LLAMA_EXAMPLE_DOWNLOAD;
if ((arg.in_example(ex) || (inherit_common && arg.in_example(LLAMA_EXAMPLE_COMMON))) && !arg.is_exclude(ex)) {
ctx_arg.options.push_back(std::move(arg));
}
};
@@ -1090,7 +1172,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
[](common_params & params) {
params.usage = true;
}
));
).set_examples({LLAMA_EXAMPLE_COMMON, LLAMA_EXAMPLE_DOWNLOAD}));
add_opt(common_arg(
{"--version"},
"show version and build info",
@@ -2212,7 +2294,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
[](common_params & params, bool value) {
params.no_mmproj = !value;
}
).set_examples(mmproj_examples).set_env("LLAMA_ARG_MMPROJ_AUTO"));
).set_examples({LLAMA_EXAMPLE_MTMD, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_CLI, LLAMA_EXAMPLE_DOWNLOAD}).set_env("LLAMA_ARG_MMPROJ_AUTO"));
add_opt(common_arg(
{"--mmproj-offload"},
{"--no-mmproj-offload"},
@@ -2611,14 +2693,14 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
[](common_params & params, const std::string & value) {
params.model.path = value;
}
).set_examples({LLAMA_EXAMPLE_COMMON, LLAMA_EXAMPLE_EXPORT_LORA}).set_env("LLAMA_ARG_MODEL"));
).set_examples({LLAMA_EXAMPLE_COMMON, LLAMA_EXAMPLE_EXPORT_LORA, LLAMA_EXAMPLE_DOWNLOAD}).set_env("LLAMA_ARG_MODEL"));
add_opt(common_arg(
{"-mu", "--model-url"}, "MODEL_URL",
"model download url (default: unused)",
[](common_params & params, const std::string & value) {
params.model.url = value;
}
).set_env("LLAMA_ARG_MODEL_URL"));
).set_examples({LLAMA_EXAMPLE_COMMON, LLAMA_EXAMPLE_DOWNLOAD}).set_env("LLAMA_ARG_MODEL_URL"));
add_opt(common_arg(
{ "-dr", "--docker-repo" }, "[<repo>/]<model>[:quant]",
"Docker Hub model repository. repo is optional, default to ai/. quant is optional, default to :latest.\n"
@@ -2627,7 +2709,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
[](common_params & params, const std::string & value) {
params.model.docker_repo = value;
}
).set_env("LLAMA_ARG_DOCKER_REPO"));
).set_examples({LLAMA_EXAMPLE_COMMON, LLAMA_EXAMPLE_DOWNLOAD}).set_env("LLAMA_ARG_DOCKER_REPO"));
add_opt(common_arg(
{"-hf", "-hfr", "--hf-repo"}, "<user>/<model>[:quant]",
"Hugging Face model repository; quant is optional, case-insensitive, default to Q4_K_M, or falls back to the first file in the repo if Q4_K_M doesn't exist.\n"
@@ -2637,14 +2719,14 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
[](common_params & params, const std::string & value) {
params.model.hf_repo = value;
}
).set_env("LLAMA_ARG_HF_REPO"));
).set_examples({LLAMA_EXAMPLE_COMMON, LLAMA_EXAMPLE_DOWNLOAD}).set_env("LLAMA_ARG_HF_REPO"));
add_opt(common_arg(
{"-hff", "--hf-file"}, "FILE",
"Hugging Face model file. If specified, it will override the quant in --hf-repo (default: unused)",
[](common_params & params, const std::string & value) {
params.model.hf_file = value;
}
).set_env("LLAMA_ARG_HF_FILE"));
).set_examples({LLAMA_EXAMPLE_COMMON, LLAMA_EXAMPLE_DOWNLOAD}).set_env("LLAMA_ARG_HF_FILE"));
add_opt(common_arg(
{"-hfv", "-hfrv", "--hf-repo-v"}, "<user>/<model>[:quant]",
"Hugging Face model repository for the vocoder model (default: unused)",
@@ -2665,7 +2747,14 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
[](common_params & params, const std::string & value) {
params.hf_token = value;
}
).set_env("HF_TOKEN"));
).set_examples({LLAMA_EXAMPLE_COMMON, LLAMA_EXAMPLE_DOWNLOAD}).set_env("HF_TOKEN"));
add_opt(common_arg(
{"--mtp"},
"also download the multi-token prediction (MTP) head, if available (default: unused)",
[](common_params & params) {
params.speculative.types.push_back(COMMON_SPECULATIVE_TYPE_DRAFT_MTP);
}
).set_examples({LLAMA_EXAMPLE_DOWNLOAD}));
add_opt(common_arg(
{"--context-file"}, "FNAME",
"file to load context from (use comma-separated values to specify multiple files)",
+12 -11
View File
@@ -8,6 +8,7 @@
#include <string>
#include <vector>
#include <cstring>
#include <memory>
// pseudo-env variable to identify preset-only arguments
#define COMMON_ARG_PRESET_LOAD_ON_STARTUP "__PRESET_LOAD_ON_STARTUP"
@@ -130,19 +131,19 @@ bool common_params_to_map(int argc, char ** argv, llama_example ex, std::map<com
// see: https://github.com/ggml-org/llama.cpp/issues/18163
void common_params_add_preset_options(std::vector<common_arg> & args);
struct common_params_handle_models_params {
common_download_callback * callback = nullptr;
bool preset_only = false; // if true, only check & download remote preset (for router mode)
struct common_models_handler {
common_download_hf_plan plan;
common_download_opts opts;
};
// populate model paths (main model, mmproj, etc) from -hf if necessary
// return true if the model is ready to use
// throw an exception if there is an error that prevents the model from being used (e.g. network error, model not found, etc)
// if params.skip_download is true, no downloads will be attempted. return false if the model is invalid or missing (e.g. ETag check failed)
bool common_params_handle_models(
common_params & params,
llama_example curr_ex,
const common_params_handle_models_params & handle_params);
// initialize downloading opts and hf_plan if needed, but does not download anything yet
common_models_handler common_models_handler_init(const common_params & params, llama_example curr_ex);
// check if the model is a preset repo (i.e. has a preset file)
bool common_models_handler_is_preset_repo(const common_models_handler & handler);
// download and update params with the downloaded model path
void common_models_handler_apply(common_models_handler & handler, common_params & params, common_download_callback * callback = nullptr);
// initialize argument parser context - used by test-arg-parser and preset
common_params_context common_params_parser_init(common_params & params, llama_example ex, void(*print_usage)(int, char **) = nullptr);
+4
View File
@@ -2758,5 +2758,9 @@ common_chat_msg common_chat_peg_parse(const common_peg_arena & src_pars
std::map<std::string, bool> common_chat_templates_get_caps(const common_chat_templates * chat_templates) {
GGML_ASSERT(chat_templates != nullptr);
GGML_ASSERT(chat_templates->template_default != nullptr);
if (chat_templates->template_tool_use != nullptr) {
// take the more expressive template when available
return chat_templates->template_tool_use->caps.to_map();
}
return chat_templates->template_default->caps.to_map();
}
+12 -8
View File
@@ -96,6 +96,7 @@ enum llama_example {
LLAMA_EXAMPLE_FIT_PARAMS,
LLAMA_EXAMPLE_RESULTS,
LLAMA_EXAMPLE_EXPORT_GRAPH_OPS,
LLAMA_EXAMPLE_DOWNLOAD,
LLAMA_EXAMPLE_COUNT,
};
@@ -290,13 +291,13 @@ struct common_params_sampling {
};
struct common_params_model {
std::string path = ""; // model local path // NOLINT
std::string url = ""; // model url to download // NOLINT
std::string hf_repo = ""; // HF repo // NOLINT
std::string hf_file = ""; // HF file // NOLINT
std::string docker_repo = ""; // Docker repo // NOLINT
std::string path = ""; // model local path
std::string url = ""; // model url to download
std::string hf_repo = ""; // HF repo
std::string hf_file = ""; // HF file
std::string docker_repo = ""; // Docker repo
std::string get_name() {
std::string get_name() const {
if (!hf_repo.empty()) {
return hf_repo;
}
@@ -305,6 +306,10 @@ struct common_params_model {
}
return path;
}
bool empty() const {
return get_name().empty();
}
};
// draft-model-based speculative decoding parameters
@@ -367,7 +372,7 @@ struct common_params_speculative {
common_params_speculative_ngram_cache ngram_cache;
bool has_dft() const {
return !draft.mparams.path.empty() || !draft.mparams.hf_repo.empty();
return !draft.mparams.empty();
}
uint32_t need_n_rs_seq() const {
@@ -519,7 +524,6 @@ struct common_params {
int32_t control_vector_layer_start = -1; // layer range for control vector
int32_t control_vector_layer_end = -1; // layer range for control vector
bool offline = false;
bool skip_download = false; // skip model file downloading
int32_t ppl_stride = 0; // stride for perplexity calculations. If left at 0, the pre-existing approach will be used.
int32_t ppl_output_type = 0; // = 0 -> ppl output is as usual, = 1 -> ppl output is num_tokens, ppl, one per line
+22 -118
View File
@@ -292,10 +292,6 @@ static int common_download_file_single_online(const std::string & url,
const bool file_exists = std::filesystem::exists(path);
if (!file_exists && opts.skip_download) {
return -2; // file is missing and download is disabled
}
if (file_exists && skip_etag) {
LOG_DBG("%s: using cached file: %s\n", __func__, path.c_str());
return 304; // 304 Not Modified - fake cached response
@@ -362,9 +358,6 @@ static int common_download_file_single_online(const std::string & url,
return 304; // 304 Not Modified - fake cached response
}
// pass this point, the file exists but is different from the server version, so we need to redownload it
if (opts.skip_download) {
return -2; // special code to indicate that the download was skipped due to etag mismatch
}
if (remove(path.c_str()) != 0) {
LOG_ERR("%s: unable to delete file: %s\n", __func__, path.c_str());
return -1;
@@ -691,19 +684,8 @@ static void list_available_gguf_files(const hf_cache::hf_files & files) {
}
}
struct hf_plan {
hf_cache::hf_file primary;
hf_cache::hf_files model_files;
hf_cache::hf_file mmproj;
hf_cache::hf_file mtp;
hf_cache::hf_file preset; // if set, only this file is downloaded
};
static hf_plan get_hf_plan(const common_params_model & model,
const common_download_opts & opts,
bool download_mmproj,
bool download_mtp) {
hf_plan plan;
common_download_hf_plan common_download_get_hf_plan(const common_params_model & model, const common_download_opts & opts) {
common_download_hf_plan plan;
hf_cache::hf_files all;
auto [repo, tag] = common_download_split_repo_tag(model.hf_repo);
@@ -752,127 +734,49 @@ static hf_plan get_hf_plan(const common_params_model & model,
plan.primary = primary;
plan.model_files = get_split_files(all, primary);
if (download_mmproj) {
if (opts.download_mmproj) {
plan.mmproj = find_best_mmproj(all, primary.path);
}
if (download_mtp) {
if (opts.download_mtp) {
plan.mtp = find_best_mtp(all, primary.path);
}
return plan;
}
struct download_task {
std::string url;
std::string path;
};
static std::vector<download_task> get_url_tasks(const common_params_model & model) {
auto split = get_gguf_split_info(model.url);
if (split.count <= 1) {
return {{model.url, model.path}};
}
auto filename = split.prefix;
if (auto pos = split.prefix.rfind('/'); pos != std::string::npos) {
filename = split.prefix.substr(pos + 1);
}
auto parent_path = std::filesystem::path(model.path).parent_path();
auto prefix_path = (parent_path / filename).string();
std::vector<download_task> tasks;
for (int i = 1; i <= split.count; i++) {
auto suffix = string_format("-%05d-of-%05d.gguf", i, split.count);
tasks.push_back({split.prefix + suffix, prefix_path + suffix});
}
return tasks;
}
common_download_model_result common_download_model(const common_params_model & model,
const common_download_opts & opts) {
common_download_model_result result;
std::vector<download_task> tasks;
hf_plan hf;
bool download_mmproj = opts.download_mmproj;
bool download_mtp = opts.download_mtp;
bool preset_only = opts.preset_only;
bool is_hf = !model.hf_repo.empty();
if (is_hf) {
hf = get_hf_plan(model, opts, download_mmproj, download_mtp);
if (!hf.preset.path.empty()) {
// if preset.ini exists, only download that file alone
tasks.push_back({hf.preset.url, hf.preset.local_path});
} else if (!preset_only) {
// only add other files if we're NOT in preset-only mode (normal run, non-router)
for (const auto & f : hf.model_files) {
tasks.push_back({f.url, f.local_path});
}
if (!hf.mmproj.path.empty()) {
tasks.push_back({hf.mmproj.url, hf.mmproj.local_path});
}
if (!hf.mtp.path.empty()) {
tasks.push_back({hf.mtp.url, hf.mtp.local_path});
}
}
} else if (!model.url.empty()) {
tasks = get_url_tasks(model);
} else {
result.model_path = model.path;
return result;
}
if (tasks.empty()) {
return result;
}
void common_download_run_tasks(const std::vector<common_download_task> & tasks) {
std::vector<std::future<int>> futures;
for (const auto & task : tasks) {
futures.push_back(std::async(std::launch::async,
[&task, &opts, is_hf]() {
return common_download_file_single(task.url, task.path, opts, is_hf);
[&task]() {
return common_download_file_single(task.url, task.local_path, task.opts, task.is_hf);
}
));
}
for (auto & f : futures) {
int status = f.get();
if (status == -2 && opts.skip_download) {
throw common_skip_download_exception();
}
for (size_t i = 0; i < futures.size(); ++i) {
std::string url = tasks[i].url;
int status = futures[i].get();
bool is_ok = is_http_status_ok(status);
if (!is_ok) {
return {};
throw std::runtime_error(string_format("Download '%s' failed with status code: %d", url.c_str(), status));
}
}
}
if (is_hf) {
if (!hf.preset.path.empty()) {
// if preset.ini is used, do not set other paths
result.preset_path = hf_cache::finalize_file(hf.preset);
} else {
for (const auto & f : hf.model_files) {
hf_cache::finalize_file(f);
}
result.model_path = hf.primary.final_path;
std::vector<std::string> common_download_get_all_parts(const std::string & url) {
auto split = get_gguf_split_info(url);
if (!hf.mmproj.path.empty()) {
result.mmproj_path = hf_cache::finalize_file(hf.mmproj);
}
if (!hf.mtp.path.empty()) {
result.mtp_path = hf_cache::finalize_file(hf.mtp);
}
}
} else {
result.model_path = model.path;
if (split.count <= 1) {
return {url};
}
return result;
std::vector<std::string> parts;
for (int i = 1; i <= split.count; i++) {
auto suffix = string_format("-%05d-of-%05d.gguf", i, split.count);
parts.push_back(split.prefix + suffix);
}
return parts;
}
//
+28 -43
View File
@@ -1,7 +1,10 @@
#pragma once
#include "hf-cache.h"
#include <string>
#include <vector>
#include <functional>
struct common_params_model;
@@ -47,67 +50,40 @@ struct common_cached_model_info {
}
};
// Options for common_download_model and common_download_file_single
// Options for common_download_file_single
struct common_download_opts {
std::string bearer_token;
common_header_list headers;
bool offline = false;
bool skip_download = false; // if true, only validation is performed, common_skip_download_exception may be thrown if the file is missing or invalid
bool download_mmproj = false;
bool download_mtp = false;
bool preset_only = false; // if true, only check & download remote preset (for router mode)
common_download_callback * callback = nullptr;
};
// Result of common_download_model
struct common_download_model_result {
std::string model_path;
std::string mmproj_path;
std::string mtp_path;
std::string preset_path;
struct common_download_task {
common_download_opts opts;
std::string url;
std::string local_path;
std::function<void()> on_done;
bool is_hf = false;
common_download_task() = default;
common_download_task(hf_cache::hf_file f,
const common_download_opts & opts,
std::function<void()> on_done = nullptr)
: opts(opts), url(f.url), local_path(f.local_path), on_done(on_done), is_hf(true) {}
};
// throw if the file is missing or invalid (e.g. ETag check failed)
struct common_skip_download_exception : public std::runtime_error {
common_skip_download_exception() : std::runtime_error("skip download") {}
};
void common_download_run_tasks(const std::vector<common_download_task> & tasks);
// Download model from HuggingFace repo or URL
//
// input (via model struct):
// - model.hf_repo: HF repo with optional tag, see common_download_split_repo_tag
// - model.hf_file: specific file in the repo (requires hf_repo)
// - model.url: simple download (used if hf_repo is empty)
// - model.path: local file path
//
// tag matching (for HF repos without model.hf_file):
// - if tag is specified, searches for GGUF matching that quantization
// - if no tag, searches for Q4_K_M, then Q4_0, then first available GGUF
//
// split GGUF: multi-part files like "model-00001-of-00003.gguf" are automatically
// detected and all parts are downloaded
//
// caching:
// - HF repos: uses HuggingFace cache
// - URLs: uses ETag-based caching
//
// when opts.offline=true, no network requests are made
// when download_mmproj=true, searches for mmproj in same directory as model or any parent directory
// then with the closest quantization bits
// when download_mtp=true, applies the same sibling search for an MTP-head GGUF
//
// returns result with model_path, mmproj_path and mtp_path (empty when not found / on failure)
common_download_model_result common_download_model(
const common_params_model & model,
const common_download_opts & opts = {}
);
// if url is a multi-part GGUF file, returns all parts, otherwise returns the single file
std::vector<std::string> common_download_get_all_parts(const std::string & url);
// returns list of cached models
std::vector<common_cached_model_info> common_list_cached_models();
// download single file from url to local path
// returns status code or -1 on error
// returns -2 if the download was skipped due to ETag mismatch (file outdated, skip_download=true)
// skip_etag: if true, don't read/write .etag files (for HF cache where filename is the hash)
int common_download_file_single(const std::string & url,
const std::string & path,
@@ -124,3 +100,12 @@ std::string common_docker_resolve_model(const std::string & docker);
// - if tag is present, removes only files matching that tag (and orphaned blobs)
// returns true if anything was removed
bool common_download_remove(const std::string & hf_repo_with_tag);
struct common_download_hf_plan {
hf_cache::hf_file primary;
hf_cache::hf_files model_files;
hf_cache::hf_file mmproj;
hf_cache::hf_file mtp;
hf_cache::hf_file preset; // if set, only this file is downloaded
};
common_download_hf_plan common_download_get_hf_plan(const common_params_model & model, const common_download_opts & opts);
+1
View File
@@ -136,6 +136,7 @@ TEXT_MODEL_MAP: dict[str, str] = {
"LlamaModel": "llama",
"Eagle3DraftModel": "llama",
"Eagle3Speculator": "llama",
"Eagle3LlamaForCausalLM": "llama",
"LlamaForCausalLMEagle3": "llama",
"LlavaForConditionalGeneration": "llama",
"LlavaStableLMEpochForCausalLM": "stablelm",
+1
View File
@@ -23,6 +23,7 @@ from .base import ModelBase, TextModel, gguf, logger
"LlavaForConditionalGeneration",
"VoxtralForConditionalGeneration",
"LlamaForCausalLMEagle3",
"Eagle3LlamaForCausalLM",
"Eagle3Speculator",
"Eagle3DraftModel",
"IQuestCoderForCausalLM",
+18
View File
@@ -413,6 +413,15 @@ In two device selection modes, the default SYCL backend is level_zero, you can c
|------------------|----------------------------------------|
| Single device | --split-mode none --main-gpu DEVICE_ID |
| Multiple devices | --split-mode layer (default) |
| Multiple devices | --split-mode tensor (tensor parallelism) |
`--split-mode tensor` (tensor parallelism) shards each layer across the selected
GPUs. It requires flash attention, which is auto-enabled when `--flash-attn` is
left at its default `auto`, so `--split-mode tensor` works out of the box.
Passing `--flash-attn off` together with `--split-mode tensor` is rejected at
context creation. The default `f16` KV cache is recommended. Tensor parallelism
is currently optimized for 2 GPUs; other device counts fall back to a generic
all-reduce.
Examples:
@@ -715,6 +724,15 @@ In two device selection modes, the default SYCL backend is level_zero, you can c
|------------------|----------------------------------------|
| Single device | --split-mode none --main-gpu DEVICE_ID |
| Multiple devices | --split-mode layer (default) |
| Multiple devices | --split-mode tensor (tensor parallelism) |
`--split-mode tensor` (tensor parallelism) shards each layer across the selected
GPUs. It requires flash attention, which is auto-enabled when `--flash-attn` is
left at its default `auto`, so `--split-mode tensor` works out of the box.
Passing `--flash-attn off` together with `--split-mode tensor` is rejected at
context creation. The default `f16` KV cache is recommended. Tensor parallelism
is currently optimized for 2 GPUs; other device counts fall back to a generic
all-reduce.
Examples:
+41 -1
View File
@@ -13,6 +13,45 @@ The `llama-server` application supports several implementations of speculative d
A much smaller model (called the _draft model_) generates drafts.
A draft model is the most used approach in speculative decoding.
### EAGLE-3 (`draft-eagle3`)
EAGLE-3 uses a small draft model that reads the target model's hidden states to predict the next tokens, so it
reaches higher acceptance than a standalone draft model of the same size. The draft is a one-layer transformer
trained for a specific target model; it shares the target model's tokenizer and, optionally, uses a reduced draft
vocabulary with its own `lm_head`, which is mapped back using a `d2t` table.
Convert the EAGLE-3 checkpoint with `--target-model-dir` so it inherits the target's tokenizer and the layer
indices to read. Both the SpecForge `LlamaForCausalLMEagle3` and the vLLM/AngelSlim `Eagle3LlamaForCausalLM`
checkpoint formats are supported (for example [`AngelSlim/Qwen3-4B_eagle3`](https://huggingface.co/AngelSlim/Qwen3-4B_eagle3)
for `Qwen/Qwen3-4B`):
```bash
python convert_hf_to_gguf.py AngelSlim/Qwen3-4B_eagle3 \
--target-model-dir Qwen/Qwen3-4B --outtype bf16 --outfile Qwen3-4B-eagle3.gguf
llama-server -m Qwen3-4B.gguf -md Qwen3-4B-eagle3.gguf --spec-type draft-eagle3
```
Supported EAGLE-3 draft models include:
- [yuhuili/EAGLE3-LLaMA3.1-Instruct-8B](https://huggingface.co/yuhuili/EAGLE3-LLaMA3.1-Instruct-8B)
- [yuhuili/EAGLE3-LLaMA3.3-Instruct-70B](https://huggingface.co/yuhuili/EAGLE3-LLaMA3.3-Instruct-70B)
- [RedHatAI/gemma-4-31B-it-speculator.eagle3](https://huggingface.co/RedHatAI/gemma-4-31B-it-speculator.eagle3)
- [RedHatAI/gemma-4-26B-A4B-it-speculator.eagle3](https://huggingface.co/RedHatAI/gemma-4-26B-A4B-it-speculator.eagle3)
- [Tengyunw/qwen3_8b_eagle3](https://huggingface.co/Tengyunw/qwen3_8b_eagle3)
- [Tengyunw/qwen3_30b_moe_eagle3](https://huggingface.co/Tengyunw/qwen3_30b_moe_eagle3)
- [AngelSlim/Qwen3-1.7B_eagle3](https://huggingface.co/AngelSlim/Qwen3-1.7B_eagle3)
- [AngelSlim/Qwen3-4B_eagle3](https://huggingface.co/AngelSlim/Qwen3-4B_eagle3)
- [AngelSlim/Qwen3-8B_eagle3](https://huggingface.co/AngelSlim/Qwen3-8B_eagle3)
- [AngelSlim/Qwen3-14B_eagle3](https://huggingface.co/AngelSlim/Qwen3-14B_eagle3)
- [AngelSlim/Qwen3-32B_eagle3](https://huggingface.co/AngelSlim/Qwen3-32B_eagle3)
- [AngelSlim/Qwen3-a3B_eagle3](https://huggingface.co/AngelSlim/Qwen3-a3B_eagle3)
- [RedHatAI/gpt-oss-20b-speculator.eagle3](https://huggingface.co/RedHatAI/gpt-oss-20b-speculator.eagle3)
- [lmsys/EAGLE3-gpt-oss-120b-bf16](https://huggingface.co/lmsys/EAGLE3-gpt-oss-120b-bf16)
- [nvidia/gpt-oss-120b-Eagle3-long-context](https://huggingface.co/nvidia/gpt-oss-120b-Eagle3-long-context)
For the full and up-to-date list of supported models, see #18039.
### n-gram Cache (`ngram-cache`)
An n-gram is a sequence of n tokens. The n-gram cache implementation maintains statistics about short n-gram sequences.
@@ -108,7 +147,7 @@ If a draft model is combined with a draftless decoding the draftless decoding ha
### General Speculative Parameters
```
--spec-type [none|draft-simple|draft-mtp|ngram-cache|ngram-simple|ngram-map-k|ngram-map-k4v|ngram-mod]
--spec-type [none|draft-simple|draft-eagle3|draft-mtp|ngram-cache|ngram-simple|ngram-map-k|ngram-map-k4v|ngram-mod]
comma-separated list of types of speculative decoding to use
(default: none)
(env: LLAMA_ARG_SPEC_TYPE)
@@ -247,6 +286,7 @@ Specifies a comma-separated list of speculative decoding types to use.
|------|-------------|
| `none` | No speculative decoding (default) |
| `draft-simple` | Use a simple draft model for speculation |
| `draft-eagle3` | Use an EAGLE-3 draft model that reads the target's hidden states |
| `draft-mtp` | Use Multi Token Prediction (MTP) heads from the main model |
| `ngram-cache` | Use n-gram cache lookup |
| `ngram-simple` | Use simple n-gram pattern matching |
+8
View File
@@ -27,6 +27,14 @@ GGML_BACKEND_API ggml_backend_buffer_type_t ggml_backend_sycl_buffer_type(int de
// split tensor buffer that splits matrices by rows across multiple devices
GGML_BACKEND_API ggml_backend_buffer_type_t ggml_backend_sycl_split_buffer_type(const float * tensor_split);
// Tensor parallelism (--split-mode tensor): comm_init/free/allreduce_tensor
// trio queried by the meta-backend via ggml_backend_reg_get_proc_address.
// See typedefs in ggml/include/ggml-backend.h. Mirrors the CUDA backend's
// pattern (ggml_backend_cuda_comm_*).
GGML_BACKEND_API void * ggml_backend_sycl_comm_init(ggml_backend_t * backends, size_t n_backends);
GGML_BACKEND_API void ggml_backend_sycl_comm_free(void * comm_ctx);
GGML_BACKEND_API bool ggml_backend_sycl_comm_allreduce_tensor(void * comm_ctx, struct ggml_tensor ** tensors);
// pinned host buffer for use with the CPU backend for faster copies between CPU and GPU
GGML_BACKEND_API ggml_backend_buffer_type_t ggml_backend_sycl_host_buffer_type(void);
+90 -46
View File
@@ -34,26 +34,26 @@ template <float (*bin_op)(const float, const float),
static __global__ void k_bin_bcast(const src0_t * src0,
const src1_t * src1,
dst_t * dst,
const int ne0,
const int ne1,
const int ne2,
const uint32_t ne0,
const uint32_t ne1,
const uint32_t ne2,
const uint3 ne3,
const uint3 ne10,
const uint3 ne11,
const uint3 ne12,
const uint3 ne13,
/*const int s0,*/
const int s1,
const int s2,
const int s3,
const int s00,
const int s01,
const int s02,
const int s03,
const int s10,
const int s11,
const int s12,
const int s13,
/*const uint32_t s0,*/
const uint32_t s1,
const uint32_t s2,
const uint32_t s3,
const uint32_t s00,
const uint32_t s01,
const uint32_t s02,
const uint32_t s03,
const uint32_t s10,
const uint32_t s11,
const uint32_t s12,
const uint32_t s13,
src1_ptrs... src1s) {
ggml_cuda_pdl_lc();
const uint32_t i0s = blockDim.x * blockIdx.x + threadIdx.x;
@@ -61,7 +61,7 @@ static __global__ void k_bin_bcast(const src0_t * src0,
const uint32_t i2 = fastdiv((blockDim.z * blockIdx.z + threadIdx.z), ne3);
const uint32_t i3 = (blockDim.z * blockIdx.z + threadIdx.z) - (i2 * ne3.z);
if (i0s >= (uint32_t)ne0 || i1 >= (uint32_t)ne1 || i2 >= (uint32_t)ne2 || i3 >= ne3.z) {
if (i0s >= ne0 || i1 >= ne1 || i2 >= ne2 || i3 >= ne3.z) {
return;
}
@@ -69,25 +69,32 @@ static __global__ void k_bin_bcast(const src0_t * src0,
const uint32_t i12 = fastmodulo(i2, ne12);
const uint32_t i13 = fastmodulo(i3, ne13);
const size_t i_src0 = i3*s03 + i2*s02 + i1*s01;
const size_t i_src1 = i13*s13 + i12*s12 + i11*s11;
const size_t i_dst = i3*s3 + i2*s2 + i1*s1;
const size_t i_src0 = size_t( i3)*s03 + size_t( i2)*s02 + size_t( i1)*s01;
const size_t i_src1 = size_t(i13)*s13 + size_t(i12)*s12 + size_t(i11)*s11;
const size_t i_dst = size_t( i3)*s3 + size_t( i2)*s2 + size_t( i1)*s1;
const src0_t * src0_row = src0 ? (src0 + i_src0) : nullptr;
dst_t * dst_row = dst + i_dst;
const uint32_t s0 = blockDim.x * gridDim.x;
ggml_cuda_pdl_sync();
for (int i0 = i0s; i0 < ne0; i0 += blockDim.x * gridDim.x) {
for (uint32_t i0 = i0s; i0 < ne0; i0 += s0) {
const uint32_t i10 = fastmodulo(i0, ne10);
float result = src0_row ? (float) src0_row[i0*s00] : 0.0f;
float result = src0_row ? (float) src0_row[size_t(i0)*s00] : 0.0f;
if constexpr (sizeof...(src1_ptrs) > 0) {
result = (..., (result = bin_op(result, (float)src1s[i_src1 + i10*s10])));
result = (..., (result = bin_op(result, (float)src1s[i_src1 + size_t(i10)*s10])));
} else {
result = bin_op(result, (float)src1[i_src1 + i10*s10]);
result = bin_op(result, (float)src1[i_src1 + size_t(i10)*s10]);
}
dst_row[i0] = (dst_t) result;
// protect i0 from overflow
if (ne0 - i0 <= s0) {
break;
}
}
}
@@ -110,19 +117,19 @@ static __global__ void k_bin_bcast_unravel(const src0_t * src0,
const uint3 ne12,
const uint3 ne13,
/*const int s0,*/
const int s1,
const int s2,
const int s3,
const int s00,
const int s01,
const int s02,
const int s03,
const int s10,
const int s11,
const int s12,
const int s13,
const uint32_t s1,
const uint32_t s2,
const uint32_t s3,
const uint32_t s00,
const uint32_t s01,
const uint32_t s02,
const uint32_t s03,
const uint32_t s10,
const uint32_t s11,
const uint32_t s12,
const uint32_t s13,
src1_ptrs... src1s) {
const int i = blockDim.x*blockIdx.x + threadIdx.x;
const uint32_t i = blockDim.x*blockIdx.x + threadIdx.x;
const uint32_t i3 = fastdiv(i, prod_012);
const uint32_t i2 = fastdiv(i - i3 * prod_012.z, prod_01);
@@ -133,25 +140,25 @@ static __global__ void k_bin_bcast_unravel(const src0_t * src0,
return;
}
const int i11 = fastmodulo(i1, ne11);
const int i12 = fastmodulo(i2, ne12);
const int i13 = fastmodulo(i3, ne13);
const uint32_t i11 = fastmodulo(i1, ne11);
const uint32_t i12 = fastmodulo(i2, ne12);
const uint32_t i13 = fastmodulo(i3, ne13);
const size_t i_src0 = i3*s03 + i2*s02 + i1*s01;
const size_t i_src1 = i13*s13 + i12*s12 + i11*s11;
const size_t i_dst = i3*s3 + i2*s2 + i1*s1;
const size_t i_src0 = size_t( i3)*s03 + size_t( i2)*s02 + size_t( i1)*s01;
const size_t i_src1 = size_t(i13)*s13 + size_t(i12)*s12 + size_t(i11)*s11;
const size_t i_dst = size_t( i3)*s3 + size_t( i2)*s2 + size_t( i1)*s1;
const src0_t * src0_row = src0 ? (src0 + i_src0) : nullptr;
dst_t * dst_row = dst + i_dst;
const int i10 = fastmodulo(i0, ne10);
const uint32_t i10 = fastmodulo(i0, ne10);
ggml_cuda_pdl_sync();
float result = src0_row ? (float) src0_row[i0*s00] : 0.0f;
float result = src0_row ? (float) src0_row[size_t(i0)*s00] : 0.0f;
if constexpr (sizeof...(src1_ptrs) > 0) {
result = (..., (result = bin_op(result, (float)src1s[i_src1 + i10*s10])));
result = (..., (result = bin_op(result, (float)src1s[i_src1 + size_t(i10)*s10])));
} else {
result = bin_op(result, (float)src1[i_src1 + i10*s10]);
result = bin_op(result, (float)src1[i_src1 + size_t(i10)*s10]);
}
dst_row[i0] = (dst_t) result;
@@ -248,6 +255,31 @@ static void launch_bin_bcast_pack(const ggml_tensor * src0, const ggml_tensor *
size_t s02 = nb02 / sizeof(src0_t);
size_t s03 = nb03 / sizeof(src0_t);
GGML_ASSERT(ne0 <= std::numeric_limits<uint32_t>::max());
GGML_ASSERT(ne1 <= std::numeric_limits<uint32_t>::max());
GGML_ASSERT(ne2 <= std::numeric_limits<uint32_t>::max());
GGML_ASSERT(ne3 <= std::numeric_limits<uint32_t>::max());
//GGML_ASSERT(s0 <= std::numeric_limits<uint32_t>::max());
GGML_ASSERT(s1 <= std::numeric_limits<uint32_t>::max());
GGML_ASSERT(s2 <= std::numeric_limits<uint32_t>::max());
GGML_ASSERT(s3 <= std::numeric_limits<uint32_t>::max());
GGML_ASSERT(s00 <= std::numeric_limits<uint32_t>::max());
GGML_ASSERT(s01 <= std::numeric_limits<uint32_t>::max());
GGML_ASSERT(s02 <= std::numeric_limits<uint32_t>::max());
GGML_ASSERT(s03 <= std::numeric_limits<uint32_t>::max());
GGML_ASSERT(s10 <= std::numeric_limits<uint32_t>::max());
GGML_ASSERT(s11 <= std::numeric_limits<uint32_t>::max());
GGML_ASSERT(s12 <= std::numeric_limits<uint32_t>::max());
GGML_ASSERT(s13 <= std::numeric_limits<uint32_t>::max());
GGML_ASSERT(cne1[0] <= std::numeric_limits<uint32_t>::max());
GGML_ASSERT(cne1[1] <= std::numeric_limits<uint32_t>::max());
GGML_ASSERT(cne1[2] <= std::numeric_limits<uint32_t>::max());
GGML_ASSERT(cne1[3] <= std::numeric_limits<uint32_t>::max());
GGML_ASSERT(nb0 % sizeof(dst_t) == 0);
GGML_ASSERT(nb1 % sizeof(dst_t) == 0);
GGML_ASSERT(nb2 % sizeof(dst_t) == 0);
@@ -263,6 +295,8 @@ static void launch_bin_bcast_pack(const ggml_tensor * src0, const ggml_tensor *
GGML_ASSERT(nb12 % sizeof(src1_t) == 0);
GGML_ASSERT(nb13 % sizeof(src1_t) == 0);
GGML_ASSERT(ne2 * ne3 <= std::numeric_limits<unsigned int>::max());
const int block_size = 128;
int64_t hne0 = std::max(ne0 / 2LL, 1LL);
@@ -281,7 +315,13 @@ static void launch_bin_bcast_pack(const ggml_tensor * src0, const ggml_tensor *
const uint3 ne13 = init_fastdiv_values((uint32_t) cne1[3]);
if (block_nums.z > 65535 || block_nums.y > 65535) {
int block_num = (ne0 * ne1 * ne2 * ne3 + block_size - 1) / block_size;
int64_t block_num = (ne0 * ne1 * ne2 * ne3 + block_size - 1) / block_size;
GGML_ASSERT(block_num <= std::numeric_limits<uint32_t>::max());
GGML_ASSERT(block_num * block_size <= std::numeric_limits<uint32_t>::max());
GGML_ASSERT(ne0 * ne1 <= std::numeric_limits<uint32_t>::max());
GGML_ASSERT(ne0 * ne1 * ne2 <= std::numeric_limits<uint32_t>::max());
const uint3 prod_012 = init_fastdiv_values((uint32_t) (ne0 * ne1 * ne2));
const uint3 prod_01 = init_fastdiv_values((uint32_t) (ne0 * ne1));
const uint3 ne0_fastdiv = init_fastdiv_values((uint32_t) ne0);
@@ -298,6 +338,10 @@ static void launch_bin_bcast_pack(const ggml_tensor * src0, const ggml_tensor *
s10, s11, s12, s13, (const src1_t *) dst->src[I + 1]->data...);
}
} else {
GGML_ASSERT(int64_t(block_nums.x) * block_dims.x <= std::numeric_limits<uint32_t>::max());
GGML_ASSERT(int64_t(block_nums.y) * block_dims.y <= std::numeric_limits<uint32_t>::max());
GGML_ASSERT(int64_t(block_nums.z) * block_dims.z <= std::numeric_limits<uint32_t>::max());
const uint3 ne3_fastdiv = init_fastdiv_values((uint32_t) ne3);
{
const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params(block_nums, block_dims, 0, stream);
+35 -29
View File
@@ -53,10 +53,10 @@ static __global__ void cpy_scalar_transpose(const char * cx, char * cdst, const
const int64_t nmat = ne / (ne00 * ne01);
const int64_t n = ne00 * ne01;
const int x = blockIdx.x * CUDA_CPY_TILE_DIM_2D + threadIdx.x;
const int y = blockIdx.y * CUDA_CPY_TILE_DIM_2D + threadIdx.y;
const int tx = blockIdx.y * CUDA_CPY_TILE_DIM_2D + threadIdx.x; // transpose block offset
const int ty = blockIdx.x * CUDA_CPY_TILE_DIM_2D + threadIdx.y;
const int64_t x = (int64_t) blockIdx.x * CUDA_CPY_TILE_DIM_2D + threadIdx.x;
const int64_t y = (int64_t) blockIdx.y * CUDA_CPY_TILE_DIM_2D + threadIdx.y;
const int64_t tx = (int64_t) blockIdx.y * CUDA_CPY_TILE_DIM_2D + threadIdx.x; // transpose block offset
const int64_t ty = (int64_t) blockIdx.x * CUDA_CPY_TILE_DIM_2D + threadIdx.y;
__shared__ float tile[2][CUDA_CPY_TILE_DIM_2D][CUDA_CPY_TILE_DIM_2D+1];
int cur_tile_buf = 0;
@@ -197,7 +197,7 @@ static void ggml_cpy_scalar_contiguous_cuda(
cudaStream_t stream) {
const int64_t num_blocks = (ne + CUDA_CPY_BLOCK_SIZE - 1) / CUDA_CPY_BLOCK_SIZE;
GGML_ASSERT(num_blocks < UINT_MAX);
GGML_ASSERT(num_blocks <= INT_MAX);
const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params((dim3)num_blocks, CUDA_CPY_BLOCK_SIZE, 0, stream);
ggml_cuda_kernel_launch(cpy_scalar_contiguous<src_t, dst_t>, launch_params, cx, cdst, ne);
}
@@ -208,6 +208,14 @@ static void ggml_cpy_scalar_cuda(
const int64_t ne00, const int64_t ne01, const int64_t ne02, const int64_t nb00, const int64_t nb01, const int64_t nb02,
const int64_t nb03, const int64_t ne10, const int64_t ne11, const int64_t ne12, const int64_t nb10, const int64_t nb11, const int64_t nb12, const int64_t nb13, cudaStream_t stream) {
const auto launch_scalar_generic = [&]() {
const int64_t num_blocks = (ne + CUDA_CPY_BLOCK_SIZE - 1) / CUDA_CPY_BLOCK_SIZE;
GGML_ASSERT(num_blocks <= INT_MAX);
const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params((dim3)num_blocks, CUDA_CPY_BLOCK_SIZE, 0, stream);
ggml_cuda_kernel_launch(cpy_scalar<cpy_1_scalar<src_t, dst_t>>, launch_params,
cx, cdst, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13);
};
if (transposed) {
GGML_ASSERT(ne == ne00*ne01*ne02); // ne[3] is 1 assumed
int64_t ne00n, ne01n, ne02n;
@@ -224,20 +232,18 @@ static void ggml_cpy_scalar_cuda(
int64_t grid_x = (ne01n + CUDA_CPY_TILE_DIM_2D - 1) / CUDA_CPY_TILE_DIM_2D;
int64_t grid_y = (ne00n + CUDA_CPY_TILE_DIM_2D - 1) / CUDA_CPY_TILE_DIM_2D;
int64_t grid_z = (ne/(ne01n*ne00n) + CUDA_CPY_BLOCK_NM - 1) / CUDA_CPY_BLOCK_NM;
GGML_ASSERT(grid_x < UINT_MAX);
GGML_ASSERT(grid_y < USHRT_MAX);
GGML_ASSERT(grid_z < USHRT_MAX);
dim3 dimGrid(grid_x, grid_y, grid_z);
dim3 dimBlock(CUDA_CPY_TILE_DIM_2D, CUDA_CPY_BLOCK_ROWS, 1);
const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params(dimGrid, dimBlock, 0, stream);
ggml_cuda_kernel_launch(cpy_scalar_transpose<dst_t>, launch_params,
cx, cdst, ne, ne00n, ne01n, ne02n, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13);
GGML_ASSERT(grid_x <= INT_MAX);
if (grid_y > USHRT_MAX || grid_z > USHRT_MAX) {
launch_scalar_generic();
} else {
dim3 dimGrid(grid_x, grid_y, grid_z);
dim3 dimBlock(CUDA_CPY_TILE_DIM_2D, CUDA_CPY_BLOCK_ROWS, 1);
const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params(dimGrid, dimBlock, 0, stream);
ggml_cuda_kernel_launch(cpy_scalar_transpose<dst_t>, launch_params,
cx, cdst, ne, ne00n, ne01n, ne02n, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13);
}
} else {
const int64_t num_blocks = (ne + CUDA_CPY_BLOCK_SIZE - 1) / CUDA_CPY_BLOCK_SIZE;
GGML_ASSERT(num_blocks < UINT_MAX);
const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params((dim3)num_blocks, CUDA_CPY_BLOCK_SIZE, 0, stream);
ggml_cuda_kernel_launch(cpy_scalar<cpy_1_scalar<src_t, dst_t>>, launch_params,
cx, cdst, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13);
launch_scalar_generic();
}
}
@@ -248,7 +254,7 @@ static void ggml_cpy_f32_q8_0_cuda(
GGML_ASSERT(ne % QK8_0 == 0);
const int64_t num_blocks = ne / QK8_0;
GGML_ASSERT(num_blocks < UINT_MAX);
GGML_ASSERT(num_blocks <= INT_MAX);
cpy_f32_q<cpy_blck_f32_q8_0, QK8_0><<<num_blocks, 1, 0, stream>>>
(cx, cdst, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13);
}
@@ -259,7 +265,7 @@ static void ggml_cpy_q8_0_f32_cuda(
const int64_t nb03, const int64_t ne10, const int64_t ne11, const int64_t ne12, const int64_t nb10, const int64_t nb11, const int64_t nb12, const int64_t nb13, cudaStream_t stream) {
const int64_t num_blocks = ne;
GGML_ASSERT(num_blocks < UINT_MAX);
GGML_ASSERT(num_blocks <= INT_MAX);
cpy_q_f32<cpy_blck_q8_0_f32, QK8_0><<<num_blocks, 1, 0, stream>>>
(cx, cdst, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13);
}
@@ -271,7 +277,7 @@ static void ggml_cpy_f32_q4_0_cuda(
GGML_ASSERT(ne % QK4_0 == 0);
const int64_t num_blocks = ne / QK4_0;
GGML_ASSERT(num_blocks < UINT_MAX);
GGML_ASSERT(num_blocks <= INT_MAX);
cpy_f32_q<cpy_blck_f32_q4_0, QK4_0><<<num_blocks, 1, 0, stream>>>
(cx, cdst, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13);
}
@@ -284,7 +290,7 @@ static void ggml_cpy_q4_0_f32_cuda(
const int64_t nb10, const int64_t nb11, const int64_t nb12, const int64_t nb13,
cudaStream_t stream) {
const int64_t num_blocks = ne;
GGML_ASSERT(num_blocks < UINT_MAX);
GGML_ASSERT(num_blocks <= INT_MAX);
cpy_q_f32<cpy_blck_q_f32<dequantize_q4_0, QK4_0>, QK4_0><<<num_blocks, 1, 0, stream>>>(
cx, cdst, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03,
ne10, ne11, ne12, nb10, nb11, nb12, nb13);
@@ -297,7 +303,7 @@ static void ggml_cpy_f32_q4_1_cuda(
GGML_ASSERT(ne % QK4_1 == 0);
const int64_t num_blocks = ne / QK4_1;
GGML_ASSERT(num_blocks < UINT_MAX);
GGML_ASSERT(num_blocks <= INT_MAX);
cpy_f32_q<cpy_blck_f32_q4_1, QK4_1><<<num_blocks, 1, 0, stream>>>
(cx, cdst, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13);
}
@@ -310,7 +316,7 @@ static void ggml_cpy_q4_1_f32_cuda(
const int64_t nb10, const int64_t nb11, const int64_t nb12, const int64_t nb13,
cudaStream_t stream) {
const int64_t num_blocks = ne;
GGML_ASSERT(num_blocks < UINT_MAX);
GGML_ASSERT(num_blocks <= INT_MAX);
cpy_q_f32<cpy_blck_q_f32<dequantize_q4_1, QK4_1>, QK4_1><<<num_blocks, 1, 0, stream>>>(
cx, cdst, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03,
ne10, ne11, ne12, nb10, nb11, nb12, nb13);
@@ -323,7 +329,7 @@ static void ggml_cpy_f32_q5_0_cuda(
GGML_ASSERT(ne % QK5_0 == 0);
const int64_t num_blocks = ne / QK5_0;
GGML_ASSERT(num_blocks < UINT_MAX);
GGML_ASSERT(num_blocks <= INT_MAX);
cpy_f32_q<cpy_blck_f32_q5_0, QK5_0><<<num_blocks, 1, 0, stream>>>
(cx, cdst, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13);
}
@@ -336,7 +342,7 @@ static void ggml_cpy_q5_0_f32_cuda(
const int64_t nb10, const int64_t nb11, const int64_t nb12, const int64_t nb13,
cudaStream_t stream) {
const int64_t num_blocks = ne;
GGML_ASSERT(num_blocks < UINT_MAX);
GGML_ASSERT(num_blocks <= INT_MAX);
cpy_q_f32<cpy_blck_q_f32<dequantize_q5_0, QK5_0>, QK5_0><<<num_blocks, 1, 0, stream>>>(
cx, cdst, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03,
ne10, ne11, ne12, nb10, nb11, nb12, nb13);
@@ -349,7 +355,7 @@ static void ggml_cpy_f32_q5_1_cuda(
GGML_ASSERT(ne % QK5_1 == 0);
const int64_t num_blocks = ne / QK5_1;
GGML_ASSERT(num_blocks < UINT_MAX);
GGML_ASSERT(num_blocks <= INT_MAX);
cpy_f32_q<cpy_blck_f32_q5_1, QK5_1><<<num_blocks, 1, 0, stream>>>
(cx, cdst, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13);
}
@@ -362,7 +368,7 @@ static void ggml_cpy_q5_1_f32_cuda(
const int64_t nb10, const int64_t nb11, const int64_t nb12, const int64_t nb13,
cudaStream_t stream) {
const int64_t num_blocks = ne;
GGML_ASSERT(num_blocks < UINT_MAX);
GGML_ASSERT(num_blocks <= INT_MAX);
cpy_q_f32<cpy_blck_q_f32<dequantize_q5_1, QK5_1>, QK5_1><<<num_blocks, 1, 0, stream>>>(
cx, cdst, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03,
ne10, ne11, ne12, nb10, nb11, nb12, nb13);
@@ -375,7 +381,7 @@ static void ggml_cpy_f32_iq4_nl_cuda(
GGML_ASSERT(ne % QK4_NL == 0);
const int64_t num_blocks = ne / QK4_NL;
GGML_ASSERT(num_blocks < UINT_MAX);
GGML_ASSERT(num_blocks <= INT_MAX);
cpy_f32_q<cpy_blck_f32_iq4_nl, QK4_NL><<<num_blocks, 1, 0, stream>>>
(cx, cdst, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13);
}
+9 -13
View File
@@ -850,6 +850,7 @@ struct ggml_backend_opencl_context {
ref_count--;
if (ref_count == 0) {
#ifdef GGML_OPENCL_PROFILING
flush_profiling_batch();
write_profiling_info();
profiling_results.clear();
#endif
@@ -10152,14 +10153,8 @@ static void ggml_cl_norm(ggml_backend_t backend, const ggml_tensor * src0, const
float eps;
memcpy(&eps, dst->op_params, sizeof(float));
const int ne00 = src0 ? src0->ne[0] : 0;
const int ne01 = src0 ? src0->ne[1] : 0;
const int ne02 = src0 ? src0->ne[2] : 0;
const int ne03 = src0 ? src0->ne[3] : 0;
const cl_ulong nb01 = src0 ? src0->nb[1] : 0;
const cl_ulong nb02 = src0 ? src0->nb[2] : 0;
const cl_ulong nb03 = src0 ? src0->nb[3] : 0;
GGML_TENSOR_LOCALS(int, ne0, src0, ne);
GGML_TENSOR_LOCALS(cl_ulong, nb0, src0, nb);
const int nth = MIN(64, ne00);
@@ -10173,11 +10168,12 @@ static void ggml_cl_norm(ggml_backend_t backend, const ggml_tensor * src0, const
CL_CHECK(clSetKernelArg(kernel, 5, sizeof(int), &ne01));
CL_CHECK(clSetKernelArg(kernel, 6, sizeof(int), &ne02));
CL_CHECK(clSetKernelArg(kernel, 7, sizeof(int), &ne03));
CL_CHECK(clSetKernelArg(kernel, 8, sizeof(cl_ulong), &nb01));
CL_CHECK(clSetKernelArg(kernel, 9, sizeof(cl_ulong), &nb02));
CL_CHECK(clSetKernelArg(kernel, 10, sizeof(cl_ulong), &nb03));
CL_CHECK(clSetKernelArg(kernel, 11, sizeof(float), &eps));
CL_CHECK(clSetKernelArg(kernel, 12, sizeof(float)*nth, NULL));
CL_CHECK(clSetKernelArg(kernel, 8, sizeof(cl_ulong), &nb00));
CL_CHECK(clSetKernelArg(kernel, 9, sizeof(cl_ulong), &nb01));
CL_CHECK(clSetKernelArg(kernel, 10, sizeof(cl_ulong), &nb02));
CL_CHECK(clSetKernelArg(kernel, 11, sizeof(cl_ulong), &nb03));
CL_CHECK(clSetKernelArg(kernel, 12, sizeof(float), &eps));
CL_CHECK(clSetKernelArg(kernel, 13, sizeof(float)*nth, NULL));
size_t global_work_size[] = {(size_t)ne01*nth, (size_t)ne02, (size_t)ne03};
size_t local_work_size[] = {(size_t)nth, 1, 1};
+5 -2
View File
@@ -24,6 +24,7 @@ kernel void kernel_norm(
int ne01,
int ne02,
int ne03,
ulong nb00,
ulong nb01,
ulong nb02,
ulong nb03,
@@ -43,7 +44,8 @@ kernel void kernel_norm(
// parallel sum
sum[get_local_id(0)] = 0.0f;
for (int i00 = get_local_id(0); i00 < ne00; i00 += get_local_size(0)) {
sum[get_local_id(0)] += x[i00];
// this kernel handles float, nb00/4 translates byte offset to element offset
sum[get_local_id(0)] += x[i00*nb00/4];
}
// reduce
barrier(CLK_LOCAL_MEM_FENCE);
@@ -60,7 +62,8 @@ kernel void kernel_norm(
global float * y = dst + i03*ne02*ne01*ne00 + i02*ne01*ne00 + i01*ne00;
sum[get_local_id(0)] = 0.0f;
for (int i00 = get_local_id(0); i00 < ne00; i00 += get_local_size(0)) {
y[i00] = x[i00] - mean;
// this kernel handles float, nb00/4 translates byte offset to element offset
y[i00] = x[i00*nb00/4] - mean;
sum[get_local_id(0)] += y[i00] * y[i00];
}
+11 -5
View File
@@ -103,8 +103,8 @@ void ggml_sycl_op_conv_3d(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
// allocate packed arrays: A_packed (k x m), B_packed (k x n)
ggml_sycl_pool_alloc<float> A_packed_alloc(ctx.pool());
ggml_sycl_pool_alloc<float> B_packed_alloc(ctx.pool());
A_packed_alloc.alloc((size_t) knl_n_total * patch_total * sizeof(float));
B_packed_alloc.alloc((size_t) knl_n_total * oc * sizeof(float));
A_packed_alloc.alloc((size_t) knl_n_total * patch_total);
B_packed_alloc.alloc((size_t) knl_n_total * oc);
float * A_packed = A_packed_alloc.get();
float * B_packed = B_packed_alloc.get();
@@ -115,10 +115,16 @@ void ggml_sycl_op_conv_3d(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
// Combined kernel: im2col -> pack A, and pack B simultaneously
const char * src1_base = (const char *) src1->data;
const char * src0_base = (const char *) src0->data;
const int64_t src1_nb0 = src1->nb[0];
const int64_t src1_nb1 = src1->nb[1];
const int64_t src1_nb2 = src1->nb[2];
const int64_t src1_nb3 = src1->nb[3];
const int64_t src1_w = src1->ne[0];
const int64_t src1_h = src1->ne[1];
const int64_t src1_d = src1->ne[2];
const bool src0_is_f32 = (src0->type == GGML_TYPE_F32);
// Compute correct strides for src0 as (knl_n_total, oc) matrix
const int64_t src0_packed_nb0 = kernel_type_size;
@@ -165,7 +171,7 @@ void ggml_sycl_op_conv_3d(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
const int64_t sz = dst_z * s2 + kz * d2 - p2;
float val = 0.0f;
if (sx >= 0 && sx < src1->ne[0] && sy >= 0 && sy < src1->ne[1] && sz >= 0 && sz < src1->ne[2]) {
if (sx >= 0 && sx < src1_w && sy >= 0 && sy < src1_h && sz >= 0 && sz < src1_d) {
const int64_t channel_idx = batch_idx * c + ic;
const char * ptr = src1_base + sx * src1_nb0 + sy * src1_nb1 + sz * src1_nb2 + channel_idx * src1_nb3;
val = *(const float *) ptr;
@@ -184,9 +190,9 @@ void ggml_sycl_op_conv_3d(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
const int64_t row = t % k;
const int64_t col = t / k;
const char * src_ptr = (const char *) src0->data + row * src0_packed_nb0 + col * src0_packed_nb1;
const char * src_ptr = src0_base + row * src0_packed_nb0 + col * src0_packed_nb1;
float v;
if (src0->type == GGML_TYPE_F32) {
if (src0_is_f32) {
v = *(const float *) src_ptr;
} else {
v = sycl::vec<sycl::half, 1>(*(const sycl::half *) src_ptr).convert<float, sycl::rounding_mode::automatic>()[0];
+255
View File
@@ -5859,6 +5859,250 @@ static ggml_backend_dev_t ggml_backend_sycl_reg_get_device(ggml_backend_reg_t re
return ctx->devices[index];
}
// ==========================================================================
// Tensor parallelism (--split-mode tensor) for the SYCL backend.
//
// The meta-backend invokes these three entry points via get_proc_address:
// * ggml_backend_sycl_comm_init - one-time per-graph setup
// * ggml_backend_sycl_comm_allreduce_tensor - per-allreduce step
// * ggml_backend_sycl_comm_free - tear-down
//
// For N=2 (dual-GPU), this is a degenerate ring allreduce with dual paths
// chosen by tensor size:
//
// * Small (nelem < 32K): FP32 direct memcpy + per-device ADD
// kernel. The kernel depends_on() its corresponding memcpy event
// so it doesn't read partial data. Both devices run in parallel.
//
// * Large (nelem >= 32K): BF16-compressed. Each device compresses
// its FP32 partial to BF16 locally, cross-device memcpys
// to the peer (half the PCI bandwidth), where it is decompressed
// and added into the local FP32 partial. 6 SYCL submissions per
// allreduce (2 compress + 2 memcpy + 2 decompress-add) vs the
// 4 for the small path, but the bandwidth saving > 6 GB/s PCIe x 2
// dominates for larger tensors.
//
// Storage: A persistent uint8_t buffer per device, sized to
// 4 * nelem bytes. Both paths reinterpret the same bytes (small path
// as nelem floats; large path as outbox + inbox = 2*nelem uint16_t
// each, using the full 4*nelem byte budget either way). Single
// alloc+free per device keeps the SYCL pool's strict-LIFO invariant
// trivial.
//
// For non-(N=2 FP32 contiguous) cases, comm_init or comm_allreduce_tensor
// returns null/false, causing the meta-backend to use its generic
// butterfly all-reduce fallback.
// ==========================================================================
struct ggml_backend_sycl_comm_context {
std::vector<ggml_backend_t> backends;
// ONE persistent per-device byte buffer, 4*nelem bytes. Both the
// FP32 small-tensor path and the BF16 large-tensor path share it
// by reinterpreting.
std::unique_ptr<ggml_sycl_pool_alloc<uint8_t>> buf0;
std::unique_ptr<ggml_sycl_pool_alloc<uint8_t>> buf1;
int64_t buf_nelem = 0;
};
void * ggml_backend_sycl_comm_init(ggml_backend_t * backends, size_t n_backends) try {
for (size_t i = 0; i < n_backends; ++i) {
if (!ggml_backend_is_sycl(backends[i])) {
return nullptr;
}
}
// Initial version: N=2 only. For N!=2, returning null makes the
// meta-backend skip this backend-specific allreduce entirely.
if (n_backends != 2) {
return nullptr;
}
auto * ctx = new ggml_backend_sycl_comm_context;
ctx->backends.assign(backends, backends + n_backends);
auto * sctx0 = (ggml_backend_sycl_context *) backends[0]->context;
auto * sctx1 = (ggml_backend_sycl_context *) backends[1]->context;
ctx->buf0 = std::make_unique<ggml_sycl_pool_alloc<uint8_t>>(sctx0->pool());
ctx->buf1 = std::make_unique<ggml_sycl_pool_alloc<uint8_t>>(sctx1->pool());
return ctx;
}
catch (const sycl::exception &) { return nullptr; }
catch (...) { return nullptr; }
void ggml_backend_sycl_comm_free(void * comm_ctx_v) {
auto * comm_ctx = static_cast<ggml_backend_sycl_comm_context *>(comm_ctx_v);
if (comm_ctx == nullptr) {
return;
}
// Sync both per-device queues so the pool_alloc destructors don't
// return memory still in use by the last kernel.
if (comm_ctx->backends.size() == 2) {
auto * sctx0 = (ggml_backend_sycl_context *) comm_ctx->backends[0]->context;
auto * sctx1 = (ggml_backend_sycl_context *) comm_ctx->backends[1]->context;
try {
sctx0->stream()->wait();
sctx1->stream()->wait();
} catch (...) { /* best effort during shutdown */ }
}
delete comm_ctx;
}
bool ggml_backend_sycl_comm_allreduce_tensor(void * comm_ctx_v, struct ggml_tensor ** tensors) try {
if (comm_ctx_v == nullptr) {
return false;
}
auto * comm_ctx = static_cast<ggml_backend_sycl_comm_context *>(comm_ctx_v);
const size_t n_backends = comm_ctx->backends.size();
// Fast path: N=2, F32/F16, contiguous, matching shapes.
if (n_backends != 2) {
return false;
}
// Accept F32 or F16 inputs natively (types must match). F16 takes the
// direct 2-byte memcpy + add path below; other types return false so the
// meta-backend uses its generic all-reduce.
if (tensors[0]->type != tensors[1]->type) {
return false;
}
if (tensors[0]->type != GGML_TYPE_F32 && tensors[0]->type != GGML_TYPE_F16) {
return false;
}
if (!ggml_is_contiguous(tensors[0]) || !ggml_is_contiguous(tensors[1])) {
return false;
}
if (ggml_nelements(tensors[0]) != ggml_nelements(tensors[1])) {
return false;
}
const int64_t nelem = ggml_nelements(tensors[0]);
const size_t nbytes = ggml_nbytes(tensors[0]);
if (nelem == 0) {
return true;
}
auto * ctx0 = (ggml_backend_sycl_context *) comm_ctx->backends[0]->context;
auto * ctx1 = (ggml_backend_sycl_context *) comm_ctx->backends[1]->context;
queue_ptr q0 = ctx0->stream();
queue_ptr q1 = ctx1->stream();
// Grow per-device byte buffers if needed (4 * nelem bytes each).
if (comm_ctx->buf_nelem < nelem) {
comm_ctx->buf0->realloc(nelem * 4);
comm_ctx->buf1->realloc(nelem * 4);
comm_ctx->buf_nelem = nelem;
}
uint8_t * buf0 = comm_ctx->buf0->get();
uint8_t * buf1 = comm_ctx->buf1->get();
// F16 native path: direct 2-byte cross-device copy + add, skipping the
// F32 round-trip the meta-backend fallback would force. Cross-device copies
// go through dev2dev_memcpy because the two devices are in separate SYCL
// contexts (a raw peer-USM q->memcpy would be a silent no-op).
if (tensors[0]->type == GGML_TYPE_F16) {
sycl::half * f16_out0 = (sycl::half *) tensors[0]->data;
sycl::half * f16_out1 = (sycl::half *) tensors[1]->data;
sycl::half * f16_tmp0 = (sycl::half *) buf0;
sycl::half * f16_tmp1 = (sycl::half *) buf1;
q0->wait();
q1->wait();
dev2dev_memcpy(ctx0->device, *q0, ctx1->device, *q1, f16_tmp0, tensors[1]->data, nbytes);
dev2dev_memcpy(ctx1->device, *q1, ctx0->device, *q0, f16_tmp1, tensors[0]->data, nbytes);
q0->submit([&](sycl::handler & h) {
h.parallel_for(sycl::range<1>(nelem), [=](sycl::id<1> i) {
f16_out0[i] = (sycl::half) ((float) f16_out0[i] + (float) f16_tmp0[i]);
});
});
q1->submit([&](sycl::handler & h) {
h.parallel_for(sycl::range<1>(nelem), [=](sycl::id<1> i) {
f16_out1[i] = (sycl::half) ((float) f16_out1[i] + (float) f16_tmp1[i]);
});
});
return true;
}
float * out0 = (float *) tensors[0]->data;
float * out1 = (float *) tensors[1]->data;
// BF16 threshold: above this, the PCIe savings from halving the
// cross-device bytes outweigh the 2 extra compress kernels.
// Below: stay on the FP32 fast path. Threshold mirrors the CUDA
// NCCL allreduce pattern for n_backends=2.
static constexpr int64_t BF16_THRESHOLD = 32768;
if (nelem < BF16_THRESHOLD) {
// FP32 small path: 4 SYCL submissions per allreduce.
float * tmp0 = (float *) buf0;
float * tmp1 = (float *) buf1;
// COMM-D2D-FIX: the two devices are in SEPARATE SYCL contexts, so a raw
// q->memcpy of a peer USM pointer is a silent no-op. Route cross-device
// copies through dev2dev_memcpy (L0 direct copy / host staging). It is
// synchronous, so wait for the local partials to be produced first.
q0->wait();
q1->wait();
dev2dev_memcpy(ctx0->device, *q0, ctx1->device, *q1, tmp0, tensors[1]->data, nbytes);
dev2dev_memcpy(ctx1->device, *q1, ctx0->device, *q0, tmp1, tensors[0]->data, nbytes);
q0->submit([&](sycl::handler & h) {
h.parallel_for(sycl::range<1>(nelem), [=](sycl::id<1> i) {
out0[i] += tmp0[i];
});
});
q1->submit([&](sycl::handler & h) {
h.parallel_for(sycl::range<1>(nelem), [=](sycl::id<1> i) {
out1[i] += tmp1[i];
});
});
return true;
}
// BF16 large path: 6 SYCL submissions per allreduce, but the
// cross-device memcpy is HALF the bytes. Pure bit-shift
// conversion (no rounding) — matches ggml's truncating fp32->bf16.
uint16_t * outbox0 = (uint16_t *) buf0;
uint16_t * inbox0 = outbox0 + nelem;
uint16_t * outbox1 = (uint16_t *) buf1;
uint16_t * inbox1 = outbox1 + nelem;
// Phase A: compress each device's local partial in parallel.
sycl::event c0 = q0->parallel_for(sycl::range<1>(nelem), [=](sycl::id<1> i) {
outbox0[i] = (uint16_t) (sycl::bit_cast<uint32_t>(out0[i]) >> 16);
});
sycl::event c1 = q1->parallel_for(sycl::range<1>(nelem), [=](sycl::id<1> i) {
outbox1[i] = (uint16_t) (sycl::bit_cast<uint32_t>(out1[i]) >> 16);
});
// Phase B: COMM-D2D-FIX-BF16 cross-device copy of compressed bytes via
// dev2dev_memcpy (separate SYCL contexts; sync copy after compress).
const size_t bf16_bytes = nelem * sizeof(uint16_t);
c0.wait();
c1.wait();
dev2dev_memcpy(ctx0->device, *q0, ctx1->device, *q1, inbox0, outbox1, bf16_bytes);
dev2dev_memcpy(ctx1->device, *q1, ctx0->device, *q0, inbox1, outbox0, bf16_bytes);
// Phase C: decompress + add into local FP32 partial.
q0->submit([&](sycl::handler & h) {
h.parallel_for(sycl::range<1>(nelem), [=](sycl::id<1> i) {
out0[i] += sycl::bit_cast<float>(((uint32_t) inbox0[i]) << 16);
});
});
q1->submit([&](sycl::handler & h) {
h.parallel_for(sycl::range<1>(nelem), [=](sycl::id<1> i) {
out1[i] += sycl::bit_cast<float>(((uint32_t) inbox1[i]) << 16);
});
});
return true;
}
catch (const sycl::exception &) { return false; }
catch (...) { return false; }
static void *ggml_backend_sycl_reg_get_proc_address(ggml_backend_reg_t reg, const char *name) {
GGML_UNUSED(reg);
@@ -5866,6 +6110,17 @@ static void *ggml_backend_sycl_reg_get_proc_address(ggml_backend_reg_t reg, cons
return (void *)ggml_backend_sycl_split_buffer_type;
}
// Tensor parallelism (--split-mode tensor) entry points.
if (strcmp(name, "ggml_backend_comm_init") == 0) {
return (void *)ggml_backend_sycl_comm_init;
}
if (strcmp(name, "ggml_backend_comm_free") == 0) {
return (void *)ggml_backend_sycl_comm_free;
}
if (strcmp(name, "ggml_backend_comm_allreduce_tensor") == 0) {
return (void *)ggml_backend_sycl_comm_allreduce_tensor;
}
// SYCL doesn't support registering host memory, left here for reference
// "ggml_backend_register_host_buffer"
// "ggml_backend_unregister_host_buffer"
+1
View File
@@ -700,6 +700,7 @@ const char * llm_type_name(llm_type type) {
case LLM_TYPE_160M: return "160M";
case LLM_TYPE_190M: return "190M";
case LLM_TYPE_220M: return "220M";
case LLM_TYPE_230M: return "230M";
case LLM_TYPE_250M: return "250M";
case LLM_TYPE_256M: return "256M";
case LLM_TYPE_270M: return "270M";
+1
View File
@@ -36,6 +36,7 @@ enum llm_type {
LLM_TYPE_160M,
LLM_TYPE_190M,
LLM_TYPE_220M,
LLM_TYPE_230M,
LLM_TYPE_250M,
LLM_TYPE_256M,
LLM_TYPE_270M,
+1 -1
View File
@@ -847,7 +847,7 @@ static void init_quantize_state_counters(quantize_state_impl & qs, std::vector<t
qs.has_tied_embeddings = false;
}
}
qs.n_ffn_down = qs.n_ffn_gate = qs.n_ffn_up = (int)qs.model.hparams.n_layer();
qs.n_ffn_down = qs.n_ffn_gate = qs.n_ffn_up = (int)qs.model.hparams.n_layer_all;
}
//
+1
View File
@@ -13,6 +13,7 @@ void llama_model_lfm2::load_arch_hparams(llama_model_loader & ml) {
hparams.n_layer_dense_lead = hparams.n_layer();
switch (hparams.n_ff()) {
case 2560: type = LLM_TYPE_230M; break;
case 4608: type = LLM_TYPE_350M; break;
case 6912: type = LLM_TYPE_700M; break;
case 8192: type = LLM_TYPE_1_2B; break;
+2
View File
@@ -8176,6 +8176,8 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
test_cases.emplace_back(new test_cpy(GGML_TYPE_I32, GGML_TYPE_I32, {256, 4, 1, 1}, {-1,-1,-1,-1}, {0, 0, 0, 0}, {0, 0, 0, 0}, true));
test_cases.emplace_back(new test_cpy(GGML_TYPE_I32, GGML_TYPE_I32, {256, 1, 4, 1}, {-1,-1,-1,-1}, {1, 2, 0, 3}, {0, 0, 0, 0}));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F32, GGML_TYPE_F32, {256, 1, 4, 1}, {-1,-1,-1,-1}, {1, 2, 0, 3}, {0, 0, 0, 0}));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F32, GGML_TYPE_F32, {2, 2097121, 1, 1}, {-1,-1,-1,-1}, {1, 0, 2, 3}));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F32, GGML_TYPE_F32, {2, 2, 524281, 1}, {-1,-1,-1,-1}, {1, 0, 2, 3}));
// CPY - different src/dst shapes (reshaping via CPY)
// Use permutations of {3, 5, 7, 32}. Total elements: 3*5*7*32 = 3360.
+2
View File
@@ -146,6 +146,8 @@ int main(int argc, char ** argv) {
}
LOG_INF("Model %d/%d, Context %d/%d: %s\n\n", m + 1, num_models, c + 1, num_contexts, result.c_str());
llama_synchronize(ctx.get());
});
}
}
+11 -13
View File
@@ -1035,25 +1035,23 @@ static cmd_params parse_cmd_params(int argc, char ** argv) {
if (!params.hf_repo.empty()) {
for (size_t i = 0; i < params.hf_repo.size(); i++) {
common_params_model model;
if (params.hf_file.empty() || params.hf_file[i].empty()) {
model.hf_repo = params.hf_repo[i];
} else {
model.hf_repo = params.hf_repo[i];
model.hf_file = params.hf_file[i];
common_params p;
p.hf_token = params.hf_token;
p.offline = params.offline;
p.model.hf_repo = params.hf_repo[i];
if (!params.hf_file.empty() && !params.hf_file[i].empty()) {
p.model.hf_file = params.hf_file[i];
}
common_download_opts opts;
opts.bearer_token = params.hf_token;
opts.offline = params.offline;
auto download_result = common_download_model(model, opts);
if (download_result.model_path.empty()) {
// only the text model file is needed
common_models_handler models_handler = common_models_handler_init(p, LLAMA_EXAMPLE_BENCH);
common_models_handler_apply(models_handler, p);
if (p.model.path.empty()) {
fprintf(stderr, "error: failed to download model from HuggingFace\n");
exit(1);
}
params.model.push_back(download_result.model_path);
params.model.push_back(p.model.path);
}
}
+23 -17
View File
@@ -115,22 +115,28 @@ if (TARGET mtmd)
endif()
endif()
add_executable(llama-llava-cli deprecation-warning.cpp)
add_executable(llama-gemma3-cli deprecation-warning.cpp)
add_executable(llama-minicpmv-cli deprecation-warning.cpp)
add_executable(llama-qwen2vl-cli deprecation-warning.cpp)
# Gate CLI binaries on LLAMA_BUILD_TOOLS so that standalone library-only
# builds (LLAMA_BUILD_MTMD=ON with LLAMA_BUILD_TOOLS=OFF — e.g. Apple
# XCFramework packaging) skip the executables entirely. LLAMA_BUILD_COMMON
# defaults to ON in standalone builds, so we cannot rely on it for gating.
if (LLAMA_BUILD_TOOLS)
add_executable(llama-llava-cli deprecation-warning.cpp)
add_executable(llama-gemma3-cli deprecation-warning.cpp)
add_executable(llama-minicpmv-cli deprecation-warning.cpp)
add_executable(llama-qwen2vl-cli deprecation-warning.cpp)
set(TARGET llama-mtmd-cli)
add_executable (${TARGET} mtmd-cli.cpp)
set_target_properties (${TARGET} PROPERTIES OUTPUT_NAME llama-mtmd-cli)
if(LLAMA_TOOLS_INSTALL)
install(TARGETS ${TARGET} RUNTIME)
set(TARGET llama-mtmd-cli)
add_executable (${TARGET} mtmd-cli.cpp)
set_target_properties (${TARGET} PROPERTIES OUTPUT_NAME llama-mtmd-cli)
if(LLAMA_TOOLS_INSTALL)
install(TARGETS ${TARGET} RUNTIME)
endif()
target_link_libraries (${TARGET} PRIVATE llama-common mtmd Threads::Threads)
target_compile_features(${TARGET} PRIVATE cxx_std_17)
# mtmd-debug tool
add_executable(llama-mtmd-debug debug/mtmd-debug.cpp)
set_target_properties(llama-mtmd-debug PROPERTIES OUTPUT_NAME llama-mtmd-debug)
target_link_libraries(llama-mtmd-debug PRIVATE llama-common mtmd Threads::Threads)
target_compile_features(llama-mtmd-debug PRIVATE cxx_std_17)
endif()
target_link_libraries (${TARGET} PRIVATE llama-common mtmd Threads::Threads)
target_compile_features(${TARGET} PRIVATE cxx_std_17)
# mtmd-debug tool
add_executable(llama-mtmd-debug debug/mtmd-debug.cpp)
set_target_properties(llama-mtmd-debug PROPERTIES OUTPUT_NAME llama-mtmd-debug)
target_link_libraries(llama-mtmd-debug PRIVATE llama-common mtmd Threads::Threads)
target_compile_features(llama-mtmd-debug PRIVATE cxx_std_17)
+5 -1
View File
@@ -40,6 +40,7 @@ struct debug_options {
bool enable_reasoning = true;
bool debug_jinja = false;
bool force_tool_call = false;
bool parallel_tool_calls = true;
output_mode mode = output_mode::BOTH;
input_message_type input_message = input_message_type::NONE;
};
@@ -87,6 +88,7 @@ static void print_usage(const char * program_name) {
LOG_ERR("\nOptions:\n");
LOG_ERR(" --no-tools Disable tool definitions\n");
LOG_ERR(" --force-tool-call Set tool calls to forced\n");
LOG_ERR(" --parallel-tool-calls=0|1 Set parallel_tool_calls (default: 1)\n");
LOG_ERR(" --generation-prompt=0|1 Set add_generation_prompt (default: 1)\n");
LOG_ERR(" --enable-reasoning=0|1 Enable reasoning parsing (default: 1)\n");
LOG_ERR(" --output=MODE Output mode: analysis, template, both (default: both)\n");
@@ -121,6 +123,8 @@ static bool parse_options(int argc, char ** argv, debug_options & opts) {
opts.debug_jinja = true;
} else if (arg == "--no-tools") {
opts.with_tools = false;
} else if (arg.rfind("--parallel-tool-calls=", 0) == 0) {
opts.parallel_tool_calls = parse_bool_option(arg.substr(22));
} else if (arg.rfind("--generation-prompt=", 0) == 0) {
opts.generation_prompt = parse_bool_option(arg.substr(20));
} else if (arg.rfind("--enable-reasoning=", 0) == 0) {
@@ -349,7 +353,7 @@ static autoparser::generation_params prepare_params(const debug_options & opts,
params.tools = json();
params.tool_choice = COMMON_CHAT_TOOL_CHOICE_NONE;
}
params.parallel_tool_calls = false;
params.parallel_tool_calls = opts.parallel_tool_calls;
return params;
}
+9 -19
View File
@@ -223,8 +223,8 @@ void server_model_meta::update_caps() {
"LLAMA_ARG_HF_REPO_FILE",
});
params.offline = true;
// params.skip_download = true; // TODO: ideally, we should validate the model here, but it takes too much time
common_params_handle_models(params, LLAMA_EXAMPLE_SERVER, {});
common_models_handler handler = common_models_handler_init(params, LLAMA_EXAMPLE_SERVER);
common_models_handler_apply(handler, params); // note: this won't download the model because offline=true
if (params.mmproj.path.empty()) {
multimodal = { false, false };
} else {
@@ -1393,9 +1393,8 @@ struct server_download_state : public common_download_callback {
bool run(common_params & params) {
try {
common_params_handle_models_params p;
p.callback = this;
common_params_handle_models(params, LLAMA_EXAMPLE_SERVER, p);
common_models_handler handler = common_models_handler_init(params, LLAMA_EXAMPLE_SERVER);
common_models_handler_apply(handler, params, this);
is_ok = true;
} catch (const std::exception & e) {
auto model_name = params.model.get_name();
@@ -1768,23 +1767,14 @@ void server_models_routes::init_routes() {
throw std::invalid_argument("model must be a non-empty string");
}
common_params_model model;
common_download_opts opts;
common_params p;
p.model.hf_repo = name;
p.hf_token = params.hf_token;
model.hf_repo = name;
opts.bearer_token = params.hf_token;
// note: we only check main model, no need sidecar here
opts.download_mmproj = false;
opts.download_mtp = false;
// first, only check if the model is valid and can be downloaded
opts.skip_download = true;
// validate by fetching metadata
bool ok = false;
try {
auto validation = common_download_model(model, opts);
ok = !validation.model_path.empty();
} catch (const common_skip_download_exception &) {
// model is valid and will be downloaded
common_models_handler_init(p, LLAMA_EXAMPLE_SERVER);
ok = true;
} catch (...) {
SRV_ERR("unknown error while validating model '%s'\n", name.c_str());
+35 -9
View File
@@ -89,15 +89,16 @@ int llama_server(int argc, char ** argv) {
llama_backend_init();
llama_numa_init(params.numa);
// note: router mode also accepts -hf remote-preset, so we need to check that first
if (!params.model.hf_repo.empty()) {
try {
common_params_handle_models_params handle_params;
handle_params.preset_only = true;
common_params_handle_models(params, LLAMA_EXAMPLE_SERVER, handle_params);
} catch (const std::exception & e) {
// ignored for now
common_models_handler models_handler;
try {
models_handler = common_models_handler_init(params, LLAMA_EXAMPLE_SERVER);
if (common_models_handler_is_preset_repo(models_handler)) {
// apply the preset and start the server in router mode
common_models_handler_apply(models_handler, params);
}
} catch (const std::exception & e) {
SRV_ERR("failed to fetch model metadata: %s\n", e.what());
return 1;
}
// router server never loads a model and must not touch the GPU
@@ -241,6 +242,19 @@ int llama_server(int argc, char ** argv) {
// Google Cloud Platform (Vertex AI) compat
ctx_http.register_gcp_compat();
// return 403 for disabled features
server_http_context::handler_t res_403 = [](const server_http_req &) {
auto res = std::make_unique<server_http_res>();
res->status = 403;
res->data = safe_json_to_str({
{"error", {
{"message", "this feature is disabled"},
{"type", "feature_disabled"},
}}
});
return res;
};
// CORS proxy (EXPERIMENTAL, only used by the Web UI for MCP)
if (params.ui_mcp_proxy) {
SRV_WRN("%s", "-----------------\n");
@@ -249,7 +263,11 @@ int llama_server(int argc, char ** argv) {
SRV_WRN("%s", "-----------------\n");
ctx_http.get ("/cors-proxy", ex_wrapper(proxy_handler_get));
ctx_http.post("/cors-proxy", ex_wrapper(proxy_handler_post));
} else {
ctx_http.get ("/cors-proxy", ex_wrapper(res_403));
ctx_http.post("/cors-proxy", ex_wrapper(res_403));
}
// EXPERIMENTAL built-in tools
if (!params.server_tools.empty()) {
try {
@@ -264,6 +282,9 @@ int llama_server(int argc, char ** argv) {
SRV_WRN("%s", "-----------------\n");
ctx_http.get ("/tools", ex_wrapper(tools.handle_get));
ctx_http.post("/tools", ex_wrapper(tools.handle_post));
} else {
ctx_http.get ("/tools", ex_wrapper(res_403));
ctx_http.post("/tools", ex_wrapper(res_403));
}
//
@@ -274,7 +295,12 @@ int llama_server(int argc, char ** argv) {
return child.run_download(params);
} else if (!is_router_server) {
// single-model mode (NOT spawned by router)
common_params_handle_models(params, LLAMA_EXAMPLE_SERVER, {});
try {
common_models_handler_apply(models_handler, params);
} catch (const std::exception & e) {
SRV_ERR("failed to download model: %s\n", e.what());
return 1;
}
}
//
+1 -1
View File
@@ -16,7 +16,7 @@ def test_mcp_no_proxy():
server.start()
res = server.make_request("GET", "/cors-proxy")
assert res.status_code == 404
assert res.status_code == 403
def test_mcp_proxy():
@@ -14,6 +14,7 @@
import { useKeyboardShortcuts } from '$lib/hooks/use-keyboard-shortcuts.svelte';
import { conversationsStore, conversations } from '$lib/stores/conversations.svelte';
import { chatStore } from '$lib/stores/chat.svelte';
import { config } from '$lib/stores/settings.svelte';
import { RouterService } from '$lib/services/router.service';
import { isMobile } from '$lib/stores/viewport.svelte';
import { TooltipSide } from '$lib/enums';
@@ -34,6 +35,14 @@
const isStripExpanded = $derived(isExpandedMode || hoveredTooltip !== null);
const isOnMobile = $derived(isMobile.current);
const alwaysShowOnDesktop = $derived(config().alwaysShowSidebarOnDesktop as boolean);
// Keep the sidebar expanded on desktop when the user pins it open
$effect(() => {
if (alwaysShowOnDesktop && !isOnMobile) {
isExpandedMode = true;
}
});
function toggleExpandedMode() {
isExpandedMode = !isExpandedMode;
@@ -183,7 +192,7 @@
/>
</div>
{#if isExpandedMode || isOnMobile}
{#if isOnMobile || (isExpandedMode && !alwaysShowOnDesktop)}
<div
class="flex items-center transition-all duration-150 ease-out {isMobile.current &&
!isExpandedMode
+6 -3
View File
@@ -392,11 +392,14 @@ class ToolsStore {
} catch (err) {
const errorMessage = err instanceof Error ? err.message : String(err);
this._error = errorMessage;
// 404 from /tools means the server was started without --tools
if (errorMessage.includes('404') || errorMessage.toLowerCase().includes('not found')) {
// 403 from /tools means the server was started without --tools
// TODO: check status code instead of relying on message
if (errorMessage.includes('this feature is disabled')) {
this._toolsEndpointUnreachable = true;
console.info('[ToolsStore] Built-in tools are disabled on the server');
} else {
console.error('[ToolsStore] Failed to fetch built-in tools:', err);
}
console.error('[ToolsStore] Failed to fetch built-in tools:', err);
} finally {
this._loading = false;
}
-8
View File
@@ -33,8 +33,6 @@
import { SETTINGS_KEYS } from '$lib/constants';
let { children } = $props();
let alwaysShowSidebarOnDesktop = $derived(config().alwaysShowSidebarOnDesktop);
let isDesktop = $derived(!isMobile.current);
let innerHeight = $state<number | undefined>();
let innerWidth = $state(browser ? window.innerWidth : 0);
@@ -164,12 +162,6 @@
updateFavicon();
});
$effect(() => {
if (alwaysShowSidebarOnDesktop && isDesktop) {
return;
}
});
// Initialize server properties on app load (run once)
$effect(() => {
// Only fetch if we don't already have props