Compare commits

..

86 Commits

Author SHA1 Message Date
Georgi Gerganov 5dcb711666 speculative : fix n_outputs_max and remove draft-simple auto-enable (#23988)
* speculative : add common_speculative_n_max helper function

Extract the speculative max-draft-size logic from server_n_outputs_max
into a reusable common_speculative_n_max() function in common/speculative.

Assisted-by: llama.cpp:local pi

* cont : draft context always has n_parallel outputs

* llama : log n_outputs_max

* speculative : remove draft-simple auto-enable

* ci : enable server tests on PRs
2026-06-01 22:26:58 +03:00
Christian Hoener zu Siederdissen 5aa3a64596 nix : add nix-nodejs facilities to build Web UI (#23846)
* nix: add nix-nodejs facilities to build Web UI

Build the Web UI locally using standard Nix systems for building NodeJS
packages.

- Create derivation for the web UI
- npm dependencies are imported via buildNodeModules. Does not require
  setting any shasum.
- Copy build artifacts to the correct folders.
- Prevents having to download from huggingface.co

Fixes #23067

* nix: simplify webui derivation using LLAMA_UI_OUT_DIR

- Move npm build to installPhase with LLAMA_UI_OUT_DIR=$out to write
  output directly to the Nix store
- Copy built assets to tools/ui/dist (source tree) instead of
  build/tools/ui/dist so CMake's copy_src_dist() finds them
2026-06-01 14:01:26 -04:00
shaofeiqi 27d9ed8397 opencl: add basic support for q5_0 and q5_1 (#23548)
* opencl: add general q5_0 support

* opencl: add general q5_1 support

* opencl: support non-uniform workgrp size

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
2026-06-01 10:06:50 -07:00
Adrien Gallouët 335abed17d vendor : update cpp-httplib to 0.46.1 (#23980)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-06-01 19:40:10 +03:00
Aman Gupta de6f727aae llama: limit max outputs of llama_context (#23861)
* llama: save more VRAM by reserving n_outputs == n_seqs when possible

* add n_outputs_per_seq

* move n_outputs_max to server-context

* change ubatch to batch everywhere
2026-06-01 18:01:38 +03:00
Shrivas Shankar 95b8b8ec1a metal: template GLU kernels to support f16/f32 (#23882)
Drops the hardcoded f32 GLU kernels in favor of a single template. We now load/store in the native tensor type (half or float) to save memory bandwidth, but keep the actual ALU compute in float to avoid exploding math in geglu/swiglu. Also opened up the dispatch gate to allow f16 inputs.
2026-06-01 15:40:28 +03:00
Jeff Bolz 55ac0909e5 vulkan: don't hold the device mutex while compiling pipelines (#23641)
* vulkan: don't hold the device mutex while compiling pipelines

We need to hold a lock while we traverse all pipelines and lazily initialize
them, but we don't need to hold it while the pipeline is being compiled. And
it doesn't need to be the same lock as the device mutex. We call load_shaders
each time a pipeline is needed, so we only need to compile that one pipeline
(and, for example, don't want to end up compiling a pipeline that another
thread should be compiling).

* remove 'needed'
2026-06-01 14:04:01 +02:00
Winston Ma bef69f1306 vulkan: reduce host memory lock contention (#23376)
* vulkan: reduces lock contention

* replace unique_lock with lock_guard
2026-06-01 14:03:32 +02:00
o7si 5aba5364d9 vocab: add normalizer.lowercase support to WPM (#23899)
* vocab : add jina-embeddings-v2-base-zh (whitespace tokenizer)

* vocab : add normalizer.lowercase support to WPM

* vocab : default normalizer.lowercase to false for whitespace pre-tokenizer
2026-06-01 14:26:47 +03:00
Johannes Gäßler 8e6fff84de TP: quantized KV cache support (#23792)
* TP: quantized KV cache support

* fix partial view

* remove overly strict assert
2026-06-01 12:30:10 +02:00
Georgi Gerganov 02a57017f6 security : disable private disclosures (#23963) 2026-06-01 13:14:12 +03:00
Junwon Hwang 48b88c3b00 model: Add EXAONE 4.5 implementations (#21733)
* Add EXAONE 4.5 and Add GQA for MMproj

* mtmd: EXAONE 4.5 vision markers and projector path

EXAONE 4.5 uses <vision> and </vision> for image boundaries; Qwen keeps
<|vision_start|> and <|vision_end|>.

Route EXAONE 4.5 through the Qwen2.5-VL-style encode path (window attention
pattern, optional mmproj input norm). Update exaone4_5 projector weights and
convert_hf_to_gguf for mmproj export.

* mtmd: load EXAONE4 nextn tensors correctly

Align EXAONE4 tensor registration with EXAONE_MOE for NextN/MTP slots and avoid skip-flag propagation on duplicated rope_freqs so model loading succeeds for EXAONE 4.5 GGUF.

* Minor fixes

* Address PR feedback

* Address PR feedback

* Fix EXAONE after merge

* Fix EXAONE 4.5 conversion

* Address PR feedback

* Refactor EXAONE 4.5 conversion

* Address PR feedback

* Fix unintended deletion

* Minor fix

---------

Co-authored-by: LG-AI-EXAONE <exaonemodels@lgresearch.ai>
2026-06-01 11:48:53 +02:00
Matt Corallo 19620004f5 vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints (#23056)
Q2_K/Q3_K/Q6_K do much better when using MMVQ on Intel BMG even
though they're only 2-byte aligned, and Q3_K still wins on
NVIDIA as well.

mesa isn't all that great at coalescing back-to-back loads from
alternating arrays, so we force it instead. Further, we can do
subtraction directly on a full int32_t rather than an i8vec4
with bit twiddling because the high bit is always free to start.

On Intel BMG on mesa, the switch to MMVQ provides an immediate
~57% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and
~78% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.

The futher switch to block loads leads to a ~24% perf increase in
tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~48% perf increase in
tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.

Finally, Xe2 wins on MMVQ even for small k, so we take the NVIDIA
override for K quants on Xe2 as well.
2026-06-01 11:46:48 +02:00
Winston Ma f8c0a19d46 vulkan: Removed unused functions (#23175) 2026-06-01 11:46:23 +02:00
Aldehir Rojas 5254a7994d common : support manually triggering the reasoning budget end sequence (#23949) 2026-06-01 11:37:11 +02:00
Georgi Gerganov e22b0de60d ci : add missing Linux label to cpu-x64-high-perf runner (#23958)
Fixes: https://github.com/ggml-org/llama.cpp/pull/23927#discussion_r3332213086

The cpu-x64-high-perf job was missing the Linux label in its runs-on
specification, causing the runner to not be discovered. All other
self-hosted Linux jobs include this label.

Assisted-by: llama.cpp:local pi
2026-06-01 10:39:59 +03:00
Neo Zhang a51142497a [SYCL] Support Q4_1, Q5_0, Q5_1 in Flash-attention (#23812)
* support Q4_1, Q5_0, Q5_1

* update ut case
2026-06-01 09:53:53 +03:00
Neo Zhang 4162522688 [SYCL] Add more types in GET_ROWS OP (#23710)
* add to support Q1_0, NVFP4, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ1_S, IQ1_M, IQ3_S, IQ4_NL, IQ4_XS, I32, MXFP4, Q2_K, Q3_K, Q5_K, and Q6_K in GET_ROWS OP

* correct the link
2026-06-01 09:53:04 +03:00
Neo Zhang 44e211cecf sycl : Optimize Q3_K mul_mat by reorder (#23725) 2026-06-01 09:50:55 +03:00
Eve af6528e6df ci: remove redundant or duplicate jobs (#23927)
* remove redundant apple job

openvino gpu and cpu test can share the same build and machine

Update build-rpc.yml

Update build-openvino.yml

cpu any doesnt make sense as we have an arm job already, so do high perf on both x86 and arm

remove duplicate x86 vulkan

combine backend sampling

Update server.yml

run server on arm as windows is x86

* emdawn on one machine only

* fix openvino, remove cpu tag as we dont have many x64 machines with that tag
2026-06-01 06:32:17 +03:00
Eric Zhang 6f165c1c64 server : handle If-None-Match weak ETags (#23916) 2026-05-31 16:21:08 -05:00
Georgi Gerganov 399739d5c5 ci : limit trigger paths for the CPU workflow (#23938) 2026-05-31 19:02:47 +03:00
o7si d4c8e2c29c vocab : add tokenizer support for jina-embeddings-v2-base-zh (#18756)
* vocab : add jina-embeddings-v2-base-zh (whitespace tokenizer)

* lowercase defaults to true

* type fix

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-31 12:37:35 +02:00
Eric Zhang 3292da09f6 ui: fix ETag truncation with MSVC compiler (#23917) 2026-05-31 11:21:23 +02:00
Vladislav e6123e2080 docs : update ZenDNN docs for Q8 support (#23791)
* docs zendnn added information about Q8 support

* docs zendnn rm unnecessary data

* docs update, links to ZenDNN docs provided

* docs zenDNN update: clarified explanation

* docs zenDNN update: one more explanation clarified

---------

Co-authored-by: plotnikov.v10 <plotnikov.v10@wb.ru>
2026-05-31 10:26:42 +02:00
Ruben Ortlam 22cadc1944 llama: only use one iGPU device by default (#23897) 2026-05-31 08:17:47 +02:00
Pascal d749821db3 webui: add custom CSS injection via config (#23904)
* webui: add custom CSS injection via config

register a customCSS setting in the Developer section under Custom JSON,
syncable so it rides the existing ui-config pass through. inject the value
into a single style element in the head, reactive on the setting. lets an
operator theme a prebuilt binary through --ui-config without rebuilding,
and lets a user set it from the settings panel.

* ui: address review from @niutech and @allozaur, rename custom JSON key and CSS field

* ui: address review from @allozaur, move custom CSS injection to a style tag in svelte:head

* ui: inject custom CSS through a svelte action instead of a bound element

move the textContent write into a use: action on the head style node.
the action is the idiomatic way to touch a node, so the no-dom-manipulating
lint rule is satisfied without a disable. value stays text through
textContent, never parsed as HTML.

* Update tools/ui/src/lib/constants/settings-keys.ts

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* ui: address review from @allozaur, rename custom config key to customJson with migration

rename the custom config key to customJson across the type, the chat
request builder, the settings save check and the custom tools reader,
keeping the custom API param name unchanged. add a non destructive
migration that copies the legacy custom key to customJson at startup.
only render the head style tag when custom CSS is set.

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2026-05-30 23:49:31 +02:00
Gaurav Garg aa46bda89b Support -fa auto in llama-bench (#23714)
* Support `-fa auto` in llama-bench

Make the default value of `-ngl` -1, similar to other tools.

Update README with latest usage and examples

* Address review comments
2026-05-31 02:03:57 +05:30
lhez d6588daa80 opencl: support bf16 by converting to f16 (#23839) 2026-05-30 10:17:47 -07:00
Pascal d38d50e7ff ui: exclude generated build dirs from prettier and eslint so lint errors stop being masked (#23910) 2026-05-30 16:50:54 +02:00
Johannes Gäßler 8b0e0db606 TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs (#23843)
* TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs

* fix afmoe TP
2026-05-30 16:48:00 +03:00
Georgi Gerganov 2d9b7c8e98 metal : restore im2col implementation for large kernels (#23901) 2026-05-30 15:26:13 +03:00
Xuan-Son Nguyen e674b1279b test: (test-llama-archs) log the config name first (#23885) 2026-05-30 12:22:38 +02:00
Georgi Gerganov 4c4e91b799 ci : update ios-xcode release job to macos-26 (#23906)
* ci : disable libcommon build from xcframework

* ocd : fix name

* ci : ios-xcode change to macos-26

* cont : pin xcode

* cont : pin xcode to minor version
2026-05-30 13:21:46 +03:00
Jinyang He d48a56effb ggml : add some lsx support (#23798)
* loongarch : optimize LSX fp16 load/store with native intrinsics

Use __lsx_vfcvtl_s_h and __lsx_vfcvt_h_s instead of scalar loops in
__lsx_f16x4_load and __lsx_f16x4_store.

* loongarch : add LSX implementation for q8_0 dot product

* loongarch : add LSX implementation for q6_K dot product

* loongarch : add LSX implementation for iq4_xs dot product

* Improve reduce ops when sun int16 pairs to int32
2026-05-30 11:53:26 +03:00
Ruben Ortlam 6e093b80ea vulkan: add Flash Attention support for BFloat16 KV cache (#23420)
* vulkan: add flash attention bf16 kv support

* vulkan: bf16 FA coopmat1 support

* vulkan: bf16 FA coopmat2 support

* fix FA bf16 f32 fallback

* fix FA bf16 coopmat1 shader

* fix FA bf16 coopmat2 shader

* code cleanup

* cleanup comment change

* address feedback

* add O_TYPE for cm2 FA

* use O_TYPE for gqaStore function

* reduce BFLOAT16 ifdefs
2026-05-30 10:39:31 +02:00
Georgi Gerganov 337528571d ci : fix s390x release job (#23898)
* ci : fix s390x release job

* ci : multi-thread build for `ios-xcode`

* ocd : names
2026-05-30 09:21:38 +03:00
Georgi Gerganov d4204b03a5 ci : clear cache instead of "no timestamp" keys + fix macos (#23895)
* ci : ios use macos-15 again

* ci : add and test ccache-clear

* cont : fix

* cont : set permission

* cont : another permission

* cont : token

* cont : print key

* cont : bring back perms

* cont : test windows

* cont : add token

* cont : cleanup

* ci : make release jobs clean-up their ccache
2026-05-30 08:52:30 +03:00
Radoslav Gerganov 1738129bee llama : do not skip iGPU when only RPC devices are present (#23868)
After #23007 reclassified integrated CUDA/HIP devices as IGPU, the device
selection logic dropped the local iGPU whenever any RPC server was added,
because RPC devices made `model->devices` non-empty. On systems where the
"iGPU" is the main compute device (e.g. Strix Halo with 128 GiB of unified
memory), this caused all tensors to be allocated on the RPC peer alone and
model loading to fail.

Gate the iGPU inclusion on `gpus.empty()` instead, so RPC peers no longer
suppress the local iGPU.

closes: #23858
2026-05-30 07:48:22 +03:00
Xuan-Son Nguyen 0821c5fcfd server: in SSE mode, send HTTP headers when slot starts (#23884)
* server: in SSE mode, send HTTP headers when slot starts

* ref to pr

* stream should be false by default
2026-05-30 00:06:29 +02:00
Reese Levine 151f3a98e9 ggml-webgpu: Check earlier for WebGPU required features (#23879) 2026-05-29 14:16:05 -07:00
Reese Levine b22da25889 ggml-webgpu: add q4_0/q8_0 SET_ROWS (#23760)
* Add q8_0 and q4_0 set_rows

* Add fast(er) quantization set_rows path

* formatting/naming

* a little more naming

* Remove unused constant

* Don't override other override

* Avoid bitcast

* Narrow relaxation
2026-05-29 14:14:11 -07:00
Ruixiang Wang 689a9a470e server-bench : add speed-bench for speculative decoding benchmarking (#23869)
* spec: add speed-bench support for benchmarking

* speed-bench : add trailing newline to requirements.txt

* speed-bench : bump datasets to 4.8.0 to fix ty check

* server-bench : remove now-unused type: ignore after datasets bump
2026-05-29 23:09:47 +02:00
Pascal 5a46b46acd app: add llama update self updater (#23865)
* wip: llama update POC

* cleaning: llama update

* llama-gen-docs

* app: delegate llama update to the install script

* app: spawn the installer detached so llama update can replace a running binary

* cleaning: inline llama update into llama.cpp, drop app-update.{cpp,h}

* app: make llama_update static

Address review from @angt
2026-05-29 23:02:40 +02:00
ValdikSS 22d66b567e ui: handle audio/vnd.wave as audio WAV file (#23754)
Firefox on Linux uses this MIME type
2026-05-29 21:41:35 +02:00
Tarek Dakhran 2084434e66 vocab : support tokenizer for LFM2.5-8B-A1B (#23826)
* vocab: Support tokenizer for LFM2.5-8B-A1B

* Keep liquid6 tokenizer in models
2026-05-29 20:25:43 +02:00
Sigbjørn Skjæret 764f1e64a1 graph : ensure DS32 kq_mask_lid is F32 (#23864) 2026-05-29 19:55:14 +02:00
Xuan-Son Nguyen b5f52280fb server: remove obsolete scripts (#23870) 2026-05-29 19:47:30 +02:00
Georgi Gerganov dc71236b6c ci : update macos release to use macos-26 runner (#23878) 2026-05-29 20:41:57 +03:00
Xuan-Son Nguyen 06d26dfdff download: add option to skip_download (#23059)
* download: add option to skip_download

* fix

* fix 2

* if file doesn't exist, respect skip_download flag
2026-05-29 16:30:55 +02:00
Saba Fallah da3f990a47 mtmd: Add DeepSeekOCR 2 Support (#20975)
* mtmd: DeepSeek-OCR 2 support, with multi-tile dynamic resolution

* introduced clip_image_f32::add_viewsep

* address PR review

- drop redundant ggml_cpy ops in both deepseekocr versions build
- drop no-op ggml_cont in build_sam
- assert num_image_tokens deepseekocr2
- view_seperator as (1, n_embd) at conversion (for both versions)
- drop redundant ggml_reshape_2d

* Update tools/mtmd/models/deepseekocr2.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2026-05-29 16:13:51 +02:00
Oliver Simons 6ed481eea4 CUDA: Check PTX version on host side to guard PDL dispatch (#23530)
* CUDA: Check PTX version on host side to guard PDL dispatch

Checking on `__CUDA_ARCH_LIST__` alone is insufficient for JIT, as this
variable doesn't differentiate between compiling for say sm_90, sm_90a
or sm_90f (so forward-jittable PTX vs. arch/family-specific PTX).

Thus, one can have a bug when compiling with
`DCMAKE_CUDA_ARCHITECTURES="89;90a"`, where current code would wrongly
dispatch to PDL on sm_90/sm_120 in forward-JIT mode.

This PR fixes this issue by checking `cudaFuncAttributes::ptxVersion` of
the incoming kernel at runtime. A check on ptxVersion alone is
sufficient, as device-codes will always be >= ptxVersion (and any
violation of this would be a severe bug in CUDA/nvcc), see:
 https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#gpu-code-code-code

* Implement MurmurHash3 mixer for better hash distribution

Magic constants were taken from boost:
https://github.com/boostorg/container_hash/blob/2698b43803c012601e6bb1a6116e83767b97986c/include/boost/container_hash/detail/hash_mix.hpp#L19-L65

* Update ggml/src/ggml-cuda/common.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Address review comments, make seed non-zero

* Apply code-formatting

* Replace std::size_t -> size_t for consistency

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-05-29 12:28:18 +02:00
Xuan-Son Nguyen cb47092b00 server: bump timeout to 3600s (#23842)
* server: bump timeout to 3600s

* nits: change wording
2026-05-29 10:23:17 +02:00
fairydreaming 1f0aa2a696 model : support for DeepseekV32ForCausalLM with generic DeepSeek Sparse Attention (DSA) implementation (#23346)
* llama : support DeepSeek V3.2 model family (with DSA lightning indexer)

* convert : handle DeepseekV32ForCausalLM architecture

* ggml : support for f16 GGML_OP_FILL

* memory : separate hparams argument in llama_kv_cache constructor

* memory : add llama_kv_cache_dsa memory (KV cache + lightning indexer cache)

* llama : support for LLM_ARCH_DEEPSEEK32

* model : llama_model_deepseek32 implementation

* model : merge two scale operations into one in DSA lightning indexer implementation

* chore : remove unused code

* model : support NVFP4 in DeepSeek V3.2

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* memory : refactoring TODO

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
2026-05-29 10:15:17 +02:00
Aman Gupta 031ddb2e08 llama: use f16 mask for FA to save VRAM (#23764)
* llama: use f16 mask for FA

* review: add llama_cast + formatting

* simplify
2026-05-29 15:44:43 +08:00
Georgi Gerganov fe12e422ad sync : ggml 2026-05-29 09:56:08 +03:00
Georgi Gerganov ea02bc37f5 ggml : bump version to 0.13.1 (ggml/1523) 2026-05-29 09:56:08 +03:00
Omid Azizi b000431a0b ngram-mod : Add missing include (#23857)
[no release]

Signed-off-by: Omid Azizi <oazizi@gimletlabs.ai>
2026-05-29 09:21:37 +03:00
Aman Gupta eef59a7642 llama: add llm_graph_input_mtp (#23643)
* llama: add llm_graph_input_mtp

* rename input_mtp -> input_token_embd

* add TODO about mtmd embedding

* cont : clean-up

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-05-29 09:17:32 +03:00
Adrien Gallouët 98e480a32e app : move licences to llama-app (#23824)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-29 07:46:11 +02:00
Andreas Kieslinger 241cbd41d2 cuda : disables launch_fattn PDL enrollment due to compiler bug (#23825) 2026-05-29 07:46:10 +03:00
Matt Corallo 33c718db1f meta : Add missing buffer set in allreduce fallback !COMPUTE clear (#23480)
Without this at least the vulkan backend will skip the `* 0` for
!COMPUTE tensors, causing corrupt output.
2026-05-29 06:30:24 +03:00
Max Krasnyansky 19e92c33ef hexagon: basic/generic op fusion support and RMS_NORM+MUL fusion (#23835)
Updating infra to enable op fusion and using RMS_NORM+MUL as the use-case.
2026-05-28 14:05:54 -07:00
Xuan-Son Nguyen 751ebd17a5 mtmd-debug: add color and rainbow mode (#23829)
* mtmd-debug: add color and rainbow mode

* fix M_PI

* max_dist
2026-05-28 20:59:14 +02:00
Xuan-Son Nguyen c8914ad4f4 mtmd: fix gemma 4 projector pre_norm (#23822) 2026-05-28 20:58:55 +02:00
lhez 408ae2b9e5 opencl: move backend info printing into its own function (#23702)
* opencl: move backend info print into its own function

* opencl: move new log line

* opencl: fix for non adreno path
2026-05-28 11:05:42 -07:00
Sigbjørn Skjæret 3ef2369551 ci : run ui publish on ubuntu-slim (#23818)
* run ui publish on self-hosted fast

* run on ubuntu-slim
2026-05-28 20:58:32 +03:00
ValdikSS 2f6c815dc4 ui: fix audio and video modality detection (#23756)
When model props are fetched asynchronously from the server,
modelPropsVersion is incremented to trigger reactivity, but
only the vision effect was listening to it.
2026-05-28 17:36:10 +02:00
Georgi Gerganov 445b7cef62 ci : releases use Github-hosted builds for the UI (#23823)
* ci : releases use Github-hosted builds for the UI

* cont : fix name
2026-05-28 17:50:32 +03:00
Adrien Gallouët 479a9a1b03 app : improve help output (#23805)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-28 16:45:06 +02:00
Saba Fallah 0b56d283bf mtmd: n_head_kv defaults to n_head (#23782)
removed AI-generated comment
2026-05-28 16:44:36 +02:00
Xuan-Son Nguyen d6be3158e1 mtmd: fix gemma 4 audio rms norm eps (#23815)
* mtmd: fix gemma 4 audio rms norm eps

* Update tools/mtmd/clip.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-28 16:31:37 +02:00
Georgi Gerganov dd1557907a ci : change Vulkan builds to Release to reduce ccache (#23820)
* ci : disable all CPU variant builds for Vulkan workflow

* cont : change cache key

* cont : change build type
2026-05-28 17:29:11 +03:00
Mikolaj Kucharski 7fb1e70b59 arg: Add LLAMA_ARG_API_KEY_FILE environment variable for --api-key-file (#23167) 2026-05-28 16:25:40 +02:00
Johannes Gäßler d374e71e55 test-llama-archs: fix table format [no release] (#23810) 2026-05-28 15:53:54 +02:00
fl0rianr 30af6e2b98 ggml: auto apply iGPU flag CUDA/HIP if integrated device (#23007) 2026-05-28 15:01:14 +02:00
redfox d7be46189f mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for … (#23729)
* mmvq Optim:  add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for SM75 TURING

* avoid a mismatch for JIT compilation of Turing device code for Ampere or newer

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-05-28 14:51:14 +02:00
Jaden_Mach bc81d47aba CUDA: route batch>=4 quantized matmul to MMQ on AMD MFMA hardware (#23227)
* CUDA: per-quant MMVQ/MMQ batch threshold on AMD MFMA hardware

The dispatcher uses a single global threshold (MMVQ_MAX_BATCH_SIZE = 8)
to choose between mul_mat_vec_q (per-row GEMV) and mul_mat_q (MFMA-tiled
GEMM) for quantized matmul. On AMD CDNA, the optimal crossover differs
substantially by quant family because the per-row GEMV cost is dominated
by dequantisation, not the dot-product itself: K-quants pay a heavier
super-block decode and so MMQ wins sooner; legacy and IQ quants have
lean decode and stay ahead until the batch fully populates an MFMA tile.

This patch introduces ggml_cuda_should_use_mmvq(type, cc, ne11) -> bool,
mirroring the existing ggml_cuda_should_use_mmq, and gates per-quant
thresholds on amd_mfma_available(cc):

  Q3_K, Q4_K, Q5_K  : MMVQ <= 3   (MMQ wins from batch=4: +5% .. +76%)
  Q2_K, Q6_K        : MMVQ <= 5   (MMQ wins from batch=6: +8% .. +35%)
  others            : MMVQ <= 8   (legacy & IQ regress under MMQ; unchanged)

Non-AMD-MFMA paths (NVIDIA, RDNA, CDNA1 without MFMA) are byte-identical
to master. GGML_CUDA_FORCE_MMVQ=1 restores the original global threshold
for A/B testing.

Measured on MI250X (gfx90a, ROCm 7.2.1) with Llama-3.2-3B-Instruct,
llama-bench pp512 across all 20 supported quants, ubatch 1..8, 10 reps.
Full table in PR description.

  Selected pp512 throughput (tok/s, ub=8):
    Q4_K_S:  559 -> 940  (+68%)
    Q5_K_S:  503 -> 884  (+76%)
    Q3_K_S:  629 -> 879  (+40%)
    Q2_K  :  615 -> 809  (+32%)
    Q6_K  :  582 -> 776  (+33%)

  Selected pp512 throughput (tok/s, ub=4):
    Q4_K_S:  444 -> 480  (+ 8%)
    Q4_0  :  682 -> 685  (+ 0%)   (no regression - retains MMVQ)
    IQ4_XS:  706 -> 698  (- 1%)   (no regression - retains MMVQ)

* CUDA: address review — inline MMVQ batch table, drop env hatch & doc block

* tune kernel selection logic for CDNA1

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-05-28 14:50:25 +02:00
Funtowicz Morgan 0b246862b9 server: minor tweaks to use more cpp features (#23785)
* misc(server): add default port to impl RAII

* misc(server): register_gcp_compat() can be const

* misc(server): use proper cpp const/auto methods

* misc(server): do not reset a unique_ptr, use make_unique instead to be exception safe
2026-05-28 14:00:25 +02:00
Max Krasnyansky a919001134 hexagon: minor refresh for HMX FA and MM (#23796)
* hex-fa: clean up qf32/fp32 handling and stride handling

* hex-fa: fix corner case fp NAN issues that were cause bad output from gemma4 on v79

* hex-fa: vectorize leftover handling

* hex-fa: avoid HVX fallback during token gen HMX has more FP16 compute capacity

* hmx-mm: remove dead code

* hmx-mm: use fastdiv in x4x2 dequant

* hmx-mm: sandwich dequant and scatter to improve perf

* hmx-mm: fixed rebase conflicts

* hmx-mm: further improve weight dequant by doing early type dispatch and precomputing fastdiv

* hmx-mm: an even earlier dispatch for per-type dequant

* hmx-mm: dequant linear types like q4_0 and q4_1 without the LUTs

This is a bit faster than LUT.

* hex-cmake: one more tweak for lto

---------

Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
2026-05-28 04:49:11 -07:00
Jeff Bolz 48e7078ee0 vulkan: fast path for walsh-hadamard transform (#23687)
* vulkan: fast path for walsh-hadamard transform

* disable for intel due to segfault
2026-05-28 13:18:43 +02:00
Jesus Talavera bb771cbd2b chat : add Granite 4.1 chat template (#23518) 2026-05-28 13:13:33 +02:00
Winston Ma 7c48fb81ce vulkan: fix wrong index variable in inner loop (#23665) 2026-05-28 12:48:34 +02:00
Winston Ma 91eb8f4fa0 vulkan: Fix memory logger unsafe iterator access (#23667) 2026-05-28 12:46:07 +02:00
Markus Tavenrath d205df6812 server, ui : Add support for HTTP ETags in llama-server (#23701)
* allow caching of ui elements in llama-server

* use fnv_hash

* Update tools/server/server-http.cpp

etag has to be set always

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2026-05-28 12:21:24 +02:00
Sachin Sharma e8d2567429 docker : add ZenDNN Dockerfile (#23716) 2026-05-28 11:40:49 +02:00
201 changed files with 11926 additions and 3818 deletions
+28 -1
View File
@@ -3,6 +3,7 @@
glibc,
config,
stdenv,
stdenvNoCC,
runCommand,
cmake,
ninja,
@@ -19,6 +20,8 @@
openssl,
shaderc,
spirv-headers,
nodejs,
importNpmLock,
useBlas ?
builtins.all (x: !x) [
useCuda
@@ -130,7 +133,31 @@ effectiveStdenv.mkDerivation (finalAttrs: {
src = lib.cleanSource ../../.;
};
postPatch = ''
# Builds the webui locally, taking care not to require updating any sha256 hash.
webui = stdenvNoCC.mkDerivation {
pname = "webui";
version = llamaVersion;
src = lib.cleanSource ../../tools/ui;
nativeBuildInputs = [
nodejs
importNpmLock.linkNodeModulesHook
];
# no sha256 required when using buildNodeModules
npmDeps = importNpmLock.buildNodeModules {
npmRoot = ../../tools/ui;
inherit nodejs;
};
installPhase = ''
LLAMA_UI_OUT_DIR=$out npm run build --offline
'';
};
postPatch = lib.optionalString useWebUi ''
cp -r ${finalAttrs.webui} tools/ui/dist
chmod -R u+w tools/ui/dist
'';
# With PR#6015 https://github.com/ggml-org/llama.cpp/pull/6015,
+101
View File
@@ -0,0 +1,101 @@
ARG UBUNTU_VERSION=24.04
ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A
FROM ubuntu:$UBUNTU_VERSION AS build
RUN apt-get update && \
apt-get install -y gcc-13 g++-13 build-essential git cmake libssl-dev libomp-dev libnuma-dev python3 ca-certificates
ENV CC=gcc-13 CXX=g++-13
WORKDIR /app
COPY . .
RUN cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_TESTS=OFF -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_ZENDNN=ON && \
cmake --build build -j $(nproc)
RUN mkdir -p /app/lib && \
find build -name "*.so*" -exec cp -P {} /app/lib \;
RUN mkdir -p /app/full \
&& cp build/bin/* /app/full \
&& cp *.py /app/full \
&& cp -r conversion /app/full \
&& cp -r gguf-py /app/full \
&& cp -r requirements /app/full \
&& cp requirements.txt /app/full \
&& cp .devops/tools.sh /app/full/tools.sh
## Base image
FROM ubuntu:$UBUNTU_VERSION AS base
ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A
ARG IMAGE_URL=https://github.com/ggml-org/llama.cpp
ARG IMAGE_SOURCE=https://github.com/ggml-org/llama.cpp
LABEL org.opencontainers.image.created=$BUILD_DATE \
org.opencontainers.image.version=$APP_VERSION \
org.opencontainers.image.revision=$APP_REVISION \
org.opencontainers.image.title="llama.cpp" \
org.opencontainers.image.description="LLM inference in C/C++" \
org.opencontainers.image.url=$IMAGE_URL \
org.opencontainers.image.source=$IMAGE_SOURCE
RUN apt-get update \
&& apt-get install -y libgomp1 libnuma1 curl \
&& apt autoremove -y \
&& apt clean -y \
&& rm -rf /tmp/* /var/tmp/* \
&& find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
&& find /var/cache -type f -delete
COPY --from=build /app/lib/ /app
### Full
FROM base AS full
COPY --from=build /app/full /app
WORKDIR /app
RUN apt-get update \
&& apt-get install -y \
git \
python3 \
python3-pip \
python3-wheel \
&& pip install --break-system-packages --upgrade setuptools \
&& pip install --break-system-packages -r requirements.txt \
&& apt autoremove -y \
&& apt clean -y \
&& rm -rf /tmp/* /var/tmp/* \
&& find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
&& find /var/cache -type f -delete
ENTRYPOINT ["/app/tools.sh"]
### Light, CLI only
FROM base AS light
COPY --from=build /app/full/llama-cli /app/full/llama-completion /app
WORKDIR /app
ENTRYPOINT [ "/app/llama-cli" ]
### Server, Server only
FROM base AS server
ENV LLAMA_ARG_HOST=0.0.0.0
COPY --from=build /app/full/llama-server /app
WORKDIR /app
HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]
ENTRYPOINT [ "/app/llama-server" ]
+22
View File
@@ -0,0 +1,22 @@
name: "ccache-clear"
description: "Delete all GitHub Actions caches matching a key prefix"
inputs:
key:
description: "Cache key prefix to match and delete"
required: true
runs:
using: "composite"
steps:
- name: Clear caches
shell: bash
run: |
CACHES=$(gh cache list --key "ccache-${{ inputs.key }}" --json id,key --jq '.[] | "\(.id) \(.key)"' 2>/dev/null)
if [ -z "$CACHES" ]; then
echo "No caches found with key prefix: ${{ inputs.key }}"
exit 0
fi
while read -r id key; do
echo "Deleting cache: $id ($key)"
gh cache delete "$id"
done <<< "$CACHES"
-34
View File
@@ -109,40 +109,6 @@ jobs:
cd build
ctest -L main --verbose --timeout 900
macos-latest-ios:
runs-on: macos-latest
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
# TODO: this likely does not do anything - if yes, remove it
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: apple-ios
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
run: |
sysctl -a
cmake -B build -G Xcode \
-DGGML_METAL_USE_BF16=ON \
-DGGML_METAL_EMBED_LIBRARY=ON \
-DLLAMA_BUILD_APP=OFF \
-DLLAMA_BUILD_COMMON=OFF \
-DLLAMA_BUILD_EXAMPLES=OFF \
-DLLAMA_BUILD_TOOLS=OFF \
-DLLAMA_BUILD_TESTS=OFF \
-DLLAMA_BUILD_SERVER=OFF \
-DCMAKE_SYSTEM_NAME=iOS \
-DCMAKE_OSX_DEPLOYMENT_TARGET=14.0 \
-DCMAKE_XCODE_ATTRIBUTE_DEVELOPMENT_TEAM=ggml
cmake --build build --config Release -j $(sysctl -n hw.logicalcpu) -- CODE_SIGNING_ALLOWED=NO
macos-latest-ios-xcode:
runs-on: macos-latest
+1 -17
View File
@@ -14,14 +14,6 @@ on:
'**/*.hpp',
'**/*.c',
'**/*.cpp',
'**/*.cu',
'**/*.cuh',
'**/*.swift',
'**/*.m',
'**/*.metal',
'**/*.comp',
'**/*.glsl',
'**/*.wgsl'
]
pull_request:
@@ -34,15 +26,7 @@ on:
'**/*.h',
'**/*.hpp',
'**/*.c',
'**/*.cpp',
'**/*.cu',
'**/*.cuh',
'**/*.swift',
'**/*.m',
'**/*.metal',
'**/*.comp',
'**/*.glsl',
'**/*.wgsl'
'**/*.cpp'
]
concurrency:
+18 -2
View File
@@ -13,6 +13,7 @@ concurrency:
queue: max
env:
GH_TOKEN: ${{ github.token }}
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_ARG_LOG_COLORS: 1
@@ -23,6 +24,9 @@ jobs:
cuda:
runs-on: windows-2022
permissions:
actions: write
strategy:
matrix:
cuda: ['12.4', '13.3']
@@ -36,7 +40,6 @@ jobs:
uses: ggml-org/ccache-action@v1.2.21
with:
key: release-windows-2022-x64-cuda-${{ matrix.cuda }}
append-timestamp: false # note: use this only with non-concurrent jobs!
- name: Install Cuda Toolkit
uses: ./.github/actions/windows-setup-cuda
@@ -67,9 +70,17 @@ jobs:
cmake --build build --config Release -j %NINJA_JOBS% -t ggml
cmake --build build --config Release
- name: ccache-clear
uses: ./.github/actions/ccache-clear
with:
key: release-windows-2022-x64-cuda-${{ matrix.cuda }}
hip:
runs-on: windows-2022
permissions:
actions: write
env:
# Make sure this is in sync with build-cache.yml
HIPSDK_INSTALLER_VERSION: "26.Q1"
@@ -125,7 +136,6 @@ jobs:
# to populate the ccache for the release with manual runs of this workflow
#key: release-windows-2022-x64-hip-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ matrix.name }}
key: cuda-windows-2022-x64-hip-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ matrix.name }}
append-timestamp: false # note: use this only with non-concurrent jobs!
- name: Build
id: cmake_build
@@ -144,3 +154,9 @@ jobs:
-DGPU_TARGETS="gfx1100" `
-DGGML_RPC=ON
cmake --build build -j ${env:NUMBER_OF_PROCESSORS}
- name: ccache-clear
uses: ./.github/actions/ccache-clear
with:
#key: release-windows-2022-x64-hip-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ matrix.name }}
key: cuda-windows-2022-x64-hip-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ matrix.name }}
+12 -36
View File
@@ -35,24 +35,12 @@ env:
jobs:
ubuntu-24-openvino:
name: ubuntu-24-openvino-${{ matrix.openvino_device }}
runs-on: [self-hosted, Linux, Intel, OpenVINO]
concurrency:
group: openvino-${{ matrix.variant }}-${{ github.head_ref || github.ref }}
group: openvino-gpu-${{ github.head_ref || github.ref }}
cancel-in-progress: false
strategy:
matrix:
include:
- variant: cpu
runner: '"ubuntu-24.04"'
openvino_device: "CPU"
- variant: gpu
runner: '["self-hosted","Linux","Intel","OpenVINO"]'
openvino_device: "GPU"
runs-on: ${{ fromJSON(matrix.runner) }}
env:
# Sync versions in build-openvino.yml, build-self-hosted.yml, release.yml, build-cache.yml, .devops/openvino.Dockerfile
OPENVINO_VERSION_MAJOR: "2026.0"
@@ -63,14 +51,6 @@ jobs:
id: checkout
uses: actions/checkout@v6
- name: ccache
if: runner.environment == 'github-hosted'
uses: ggml-org/ccache-action@v1.2.21
with:
key: openvino-ubuntu-24.04-${{ matrix.variant }}-no-preset-v1
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
run: |
@@ -78,16 +58,7 @@ jobs:
sudo apt-get install -y build-essential libssl-dev libtbb12 cmake ninja-build python3-pip
sudo apt-get install -y ocl-icd-opencl-dev opencl-headers opencl-clhpp-headers intel-opencl-icd
- name: Use OpenVINO Toolkit Cache
if: runner.environment == 'github-hosted'
uses: actions/cache@v5
id: cache-openvino
with:
path: ./openvino_toolkit
key: cache-gha-openvino-toolkit-v${{ env.OPENVINO_VERSION_FULL }}-${{ runner.os }}
- name: Setup OpenVINO Toolkit
if: steps.cache-openvino.outputs.cache-hit != 'true'
uses: ./.github/actions/linux-setup-openvino
with:
path: ./openvino_toolkit
@@ -109,12 +80,17 @@ jobs:
-DGGML_OPENVINO=ON
time cmake --build build/ReleaseOV --config Release -j $(nproc)
- name: Test
id: cmake_test
- name: Test (CPU)
id: cmake_test_cpu
# TODO: fix and re-enable the `test-llama-archs` test below
run: |
cd ${{ github.workspace }}
if [ "${{ matrix.openvino_device }}" = "GPU" ]; then
export GGML_OPENVINO_DEVICE=GPU
fi
ctest --test-dir build/ReleaseOV -L main -E "test-llama-archs" --verbose --timeout 2000
- name: Test (GPU)
id: cmake_test_gpu
# TODO: fix and re-enable the `test-llama-archs` test below
run: |
cd ${{ github.workspace }}
export GGML_OPENVINO_DEVICE=GPU
ctest --test-dir build/ReleaseOV -L main -E "test-llama-archs" --verbose --timeout 2000
+2 -2
View File
@@ -34,8 +34,8 @@ env:
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
ubuntu-latest-rpc:
runs-on: ubuntu-latest
ubuntu-24-rpc:
runs-on: ${{ 'ubuntu-24.04-arm' || 'ubuntu-24.04' }}
continue-on-error: true
+7 -20
View File
@@ -210,7 +210,7 @@ jobs:
GG_BUILD_WEBGPU=1 GG_BUILD_WEBGPU_DAWN_PREFIX="$GITHUB_WORKSPACE/dawn" \
bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
gpu-vulkan:
gpu-vulkan-apple:
runs-on: [self-hosted, macOS, ARM64]
steps:
@@ -261,7 +261,7 @@ jobs:
# a valid python environment for testing
LLAMA_FATAL_WARNINGS=OFF GG_BUILD_NINJA=1 GG_BUILD_VULKAN=1 GG_BUILD_LOW_PERF=1 ./ci/run.sh ./results/llama.cpp ./mnt/llama.cpp
cpu-openvino-low-perf:
gpu-openvino-low-perf:
runs-on: [self-hosted, Linux, Intel, OpenVINO]
concurrency:
@@ -297,8 +297,8 @@ jobs:
source ./openvino_toolkit/setupvars.sh
GG_BUILD_OPENVINO=1 GGML_OPENVINO_DEVICE=GPU GG_BUILD_LOW_PERF=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
cpu-any-low-perf:
runs-on: [self-hosted, CPU]
cpu-x64-high-perf:
runs-on: [self-hosted, Linux, X64]
steps:
- name: Clone
@@ -308,22 +308,9 @@ jobs:
- name: Test
id: ggml-ci
run: |
LLAMA_ARG_THREADS=$(nproc) GG_BUILD_LOW_PERF=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
LLAMA_ARG_THREADS=$(nproc) GG_BUILD_HIGH_PERF=1 GG_BUILD_EXTRA_TESTS_0=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
cpu-any-high-perf:
runs-on: [self-hosted, CPU]
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Test
id: ggml-ci
run: |
LLAMA_ARG_THREADS=$(nproc) GG_BUILD_HIGH_PERF=1 GG_BUILD_NO_SVE=1 GG_BUILD_NO_BF16=1 GG_BUILD_EXTRA_TESTS_0=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
cpu-arm64-graviton4:
cpu-arm64-high-perf-graviton4:
runs-on: ah-ubuntu_22_04-c8g_8x
steps:
@@ -360,7 +347,7 @@ jobs:
- name: Test
id: ggml-ci
run: |
LLAMA_ARG_THREADS=$(nproc) GG_BUILD_NO_BF16=1 GG_BUILD_EXTRA_TESTS_0=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
LLAMA_ARG_THREADS=$(nproc) GG_BUILD_HIGH_PERF=1 GG_BUILD_NO_BF16=1 GG_BUILD_EXTRA_TESTS_0=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
cpu-arm64-graviton4-kleidiai:
runs-on: ah-ubuntu_22_04-c8g_8x
+18 -28
View File
@@ -36,30 +36,14 @@ env:
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
ubuntu:
strategy:
matrix:
include:
- build: 'x64'
os: ubuntu-24.04
- build: 'arm64'
os: ubuntu-24.04-arm
runs-on: ${{ matrix.os }}
ubuntu-arm64:
runs-on: ubuntu-24.04-arm
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: vulkan-${{ matrix.os }}
variant: ccache
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
run: |
@@ -68,14 +52,20 @@ jobs:
echo "CC=gcc-14" >> "$GITHUB_ENV"
echo "CXX=g++-14" >> "$GITHUB_ENV"
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: vulkan-ubuntu-24.04-arm-new
variant: ccache
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Configure
id: cmake_configure
run: |
cmake -B build \
-G "Ninja" \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DGGML_BACKEND_DL=ON \
-DGGML_CPU_ALL_VARIANTS=ON \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_VULKAN=ON
- name: Build
@@ -91,13 +81,6 @@ jobs:
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: vulkan-ubuntu-24.04-llvmpipe
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
run: |
@@ -124,6 +107,13 @@ jobs:
path: ./vulkan_sdk
version: ${{ env.VULKAN_SDK_VERSION }}
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: vulkan-ubuntu-24.04-llvmpipe
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
run: |
+2 -10
View File
@@ -130,15 +130,7 @@ jobs:
ctest -L main -E test-backend-ops --verbose --timeout 900
ubuntu-wasm:
strategy:
matrix:
include:
- build: 'x64'
os: ubuntu-24.04
- build: 'arm64'
os: ubuntu-24.04-arm
runs-on: ${{ matrix.os }}
runs-on: ubuntu-24.04-arm
steps:
- name: Clone
@@ -148,7 +140,7 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: webgpu-${{ matrix.os }}-wasm
key: webgpu-ubuntu-24.04-arm-wasm
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
+151 -81
View File
@@ -28,6 +28,7 @@ on:
]
env:
GH_TOKEN: ${{ github.token }}
BRANCH_NAME: ${{ github.head_ref || github.ref_name }}
CMAKE_ARGS: "-DLLAMA_BUILD_EXAMPLES=OFF -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_TOOLS=ON -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON"
@@ -37,7 +38,7 @@ concurrency:
queue: max
jobs:
check_release:
check-release:
runs-on: ubuntu-slim
outputs:
@@ -59,14 +60,14 @@ jobs:
fi
macos-cpu:
needs: [check_release]
if: ${{ needs.check_release.outputs.should_release == 'true' }}
needs: [check-release]
if: ${{ needs.check-release.outputs.should_release == 'true' }}
strategy:
matrix:
include:
- build: 'arm64'
arch: 'arm64'
os: macos-14
os: macos-26
defines: "-DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON"
# TODO: this build is disabled to save Github Actions resources (https://github.com/ggml-org/llama.cpp/pull/23780)
# in order to enable it again, we have to provision dedicated runners to run it
@@ -83,6 +84,9 @@ jobs:
runs-on: ${{ matrix.os }}
permissions:
actions: write
steps:
- name: Clone
id: checkout
@@ -101,7 +105,6 @@ jobs:
uses: ggml-org/ccache-action@v1.2.21
with:
key: release-${{ matrix.os }}-${{ matrix.arch }}
append-timestamp: false # note: use this only with non-concurrent jobs!
- name: Build
id: cmake_build
@@ -116,6 +119,11 @@ jobs:
${{ env.CMAKE_ARGS }}
cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
- name: ccache-clear
uses: ./.github/actions/ccache-clear
with:
key: release-${{ matrix.os }}-${{ matrix.arch }}
- name: Determine tag name
id: tag
uses: ./.github/actions/get-tag-name
@@ -133,8 +141,8 @@ jobs:
name: llama-bin-macos-${{ matrix.build }}.tar.gz
ubuntu-cpu:
needs: [check_release]
if: ${{ needs.check_release.outputs.should_release == 'true' }}
needs: [check-release]
if: ${{ needs.check-release.outputs.should_release == 'true' }}
strategy:
matrix:
include:
@@ -147,6 +155,9 @@ jobs:
runs-on: ${{ matrix.os }}
permissions:
actions: write
steps:
- name: Clone
id: checkout
@@ -161,13 +172,6 @@ jobs:
cache: "npm"
cache-dependency-path: "tools/ui/package-lock.json"
- name: ccache
if: ${{ matrix.build != 's390x' }}
uses: ggml-org/ccache-action@v1.2.21
with:
key: release-${{ matrix.os }}-cpu
append-timestamp: false # note: use this only with non-concurrent jobs!
- name: Dependencies
id: depends
run: |
@@ -181,6 +185,12 @@ jobs:
echo "CC=gcc-14" >> "$GITHUB_ENV"
echo "CXX=g++-14" >> "$GITHUB_ENV"
- name: ccache
if: ${{ matrix.build != 's390x' }}
uses: ggml-org/ccache-action@v1.2.21
with:
key: release-${{ matrix.os }}-cpu
- name: Build
id: cmake_build
run: |
@@ -194,6 +204,12 @@ jobs:
${{ env.CMAKE_ARGS }}
cmake --build build --config Release -j $(nproc)
- name: ccache-clear
if: ${{ matrix.build != 's390x' }}
uses: ./.github/actions/ccache-clear
with:
key: release-${{ matrix.os }}-cpu
- name: Determine tag name
id: tag
uses: ./.github/actions/get-tag-name
@@ -211,8 +227,8 @@ jobs:
name: llama-bin-ubuntu-${{ matrix.build }}.tar.gz
ubuntu-vulkan:
needs: [check_release]
if: ${{ needs.check_release.outputs.should_release == 'true' }}
needs: [check-release]
if: ${{ needs.check-release.outputs.should_release == 'true' }}
strategy:
matrix:
@@ -224,6 +240,9 @@ jobs:
runs-on: ${{ matrix.os }}
permissions:
actions: write
steps:
- name: Clone
id: checkout
@@ -238,12 +257,6 @@ jobs:
cache: "npm"
cache-dependency-path: "tools/ui/package-lock.json"
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: release-${{ matrix.os }}-vulkan
append-timestamp: false # note: use this only with non-concurrent jobs!
- name: Dependencies
id: depends
run: |
@@ -259,6 +272,11 @@ jobs:
echo "CXX=g++-14" >> "$GITHUB_ENV"
fi
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: release-${{ matrix.os }}-vulkan
- name: Build
id: cmake_build
run: |
@@ -272,6 +290,11 @@ jobs:
${{ env.CMAKE_ARGS }}
cmake --build build --config Release -j $(nproc)
- name: ccache-clear
uses: ./.github/actions/ccache-clear
with:
key: release-${{ matrix.os }}-vulkan
- name: Determine tag name
id: tag
uses: ./.github/actions/get-tag-name
@@ -289,11 +312,14 @@ jobs:
name: llama-bin-ubuntu-vulkan-${{ matrix.build }}.tar.gz
android-arm64:
needs: [check_release]
if: ${{ needs.check_release.outputs.should_release == 'true' }}
needs: [check-release]
if: ${{ needs.check-release.outputs.should_release == 'true' }}
runs-on: ubuntu-latest
#permissions:
# actions: write
env:
NDK_VERSION: "29.0.14206865"
@@ -311,18 +337,6 @@ jobs:
cache: "npm"
cache-dependency-path: "tools/ui/package-lock.json"
# note : disabled to spare some cache space (https://github.com/ggml-org/llama.cpp/pull/23789)
# for some reason, the ccache does not improve the build time in this case
# example:
# cache off: https://github.com/ggerganov/tmp2/actions/runs/26534713799/job/78160400831
# cache on: https://github.com/ggerganov/tmp2/actions/runs/26534713799/job/78224189394
#
#- name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: release-android-arm64
# append-timestamp: false # note: use this only with non-concurrent jobs!
- name: Set up JDK
uses: actions/setup-java@v5
with:
@@ -339,6 +353,17 @@ jobs:
sdkmanager "ndk;${{ env.NDK_VERSION }}"
echo "ANDROID_NDK=${ANDROID_SDK_ROOT}/ndk/${{ env.NDK_VERSION }}" >> $GITHUB_ENV
# note : disabled to spare some cache space (https://github.com/ggml-org/llama.cpp/pull/23789)
# for some reason, the ccache does not improve the build time in this case
# example:
# cache off: https://github.com/ggerganov/tmp2/actions/runs/26534713799/job/78160400831
# cache on: https://github.com/ggerganov/tmp2/actions/runs/26534713799/job/78224189394
#
#- name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: release-android-arm64
- name: Build
id: cmake_build
run: |
@@ -357,6 +382,11 @@ jobs:
${{ env.CMAKE_ARGS }}
cmake --build build --config Release -j $(nproc)
#- name: ccache-clear
# uses: ./.github/actions/ccache-clear
# with:
# key: release-android-arm64
- name: Determine tag name
id: tag
uses: ./.github/actions/get-tag-name
@@ -374,11 +404,14 @@ jobs:
name: llama-bin-android-arm64.tar.gz
ubuntu-24-openvino:
needs: [check_release]
if: ${{ needs.check_release.outputs.should_release == 'true' }}
needs: [check-release]
if: ${{ needs.check-release.outputs.should_release == 'true' }}
runs-on: ubuntu-24.04
permissions:
actions: write
outputs:
openvino_version: ${{ steps.openvino_version.outputs.value }}
@@ -409,7 +442,6 @@ jobs:
uses: ggml-org/ccache-action@v1.2.21
with:
key: release-ubuntu-24.04-openvino-release-no-preset-v1
append-timestamp: false # note: use this only with non-concurrent jobs!
- name: Dependencies
run: |
@@ -447,6 +479,11 @@ jobs:
-DGGML_OPENVINO=ON
cmake --build build/ReleaseOV --config Release -j $(nproc)
- name: ccache-clear
uses: ./.github/actions/ccache-clear
with:
key: release-ubuntu-24.04-openvino-release-no-preset-v1
- name: Determine tag name
id: tag
uses: ./.github/actions/get-tag-name
@@ -464,11 +501,14 @@ jobs:
name: llama-bin-ubuntu-openvino-${{ env.OPENVINO_VERSION_MAJOR }}-x64.tar.gz
windows-cpu:
needs: [check_release]
if: ${{ needs.check_release.outputs.should_release == 'true' }}
needs: [check-release]
if: ${{ needs.check-release.outputs.should_release == 'true' }}
runs-on: windows-2025
permissions:
actions: write
strategy:
matrix:
include:
@@ -488,15 +528,14 @@ jobs:
cache: "npm"
cache-dependency-path: "tools/ui/package-lock.json"
- name: Install Ninja
run: |
choco install ninja
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: release-windows-2025-${{ matrix.arch }}-cpu
append-timestamp: false # note: use this only with non-concurrent jobs!
- name: Install Ninja
run: |
choco install ninja
- name: Build
shell: cmd
@@ -512,6 +551,11 @@ jobs:
${{ env.CMAKE_ARGS }}
cmake --build build --config Release
- name: ccache-clear
uses: ./.github/actions/ccache-clear
with:
key: release-windows-2025-${{ matrix.arch }}-cpu
- name: Pack artifacts
id: pack_artifacts
run: |
@@ -525,11 +569,14 @@ jobs:
name: llama-bin-win-cpu-${{ matrix.arch }}.zip
windows:
needs: [check_release]
if: ${{ needs.check_release.outputs.should_release == 'true' }}
needs: [check-release]
if: ${{ needs.check-release.outputs.should_release == 'true' }}
runs-on: windows-2025
permissions:
actions: write
env:
OPENBLAS_VERSION: 0.3.23
VULKAN_VERSION: 1.4.313.2
@@ -558,12 +605,6 @@ jobs:
cache: "npm"
cache-dependency-path: "tools/ui/package-lock.json"
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: release-windows-2025-${{ matrix.arch }}-${{ matrix.backend }}
append-timestamp: false # note: use this only with non-concurrent jobs!
- name: Install Vulkan SDK
id: get_vulkan
if: ${{ matrix.backend == 'vulkan' }}
@@ -578,6 +619,11 @@ jobs:
run: |
choco install ninja
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: release-windows-2025-${{ matrix.arch }}-${{ matrix.backend }}
- name: Install OpenCL Headers and Libs
id: install_opencl
if: ${{ matrix.backend == 'opencl-adreno' && matrix.arch == 'arm64' }}
@@ -604,6 +650,11 @@ jobs:
cmake -S . -B build ${{ matrix.defines }} -DGGML_NATIVE=OFF -DGGML_CPU=OFF -DGGML_BACKEND_DL=ON -DLLAMA_BUILD_BORINGSSL=ON
cmake --build build --config Release --target ${{ matrix.target }}
- name: ccache-clear
uses: ./.github/actions/ccache-clear
with:
key: release-windows-2025-${{ matrix.arch }}-${{ matrix.backend }}
- name: Pack artifacts
id: pack_artifacts
run: |
@@ -616,11 +667,14 @@ jobs:
name: llama-bin-win-${{ matrix.backend }}-${{ matrix.arch }}.zip
windows-cuda:
needs: [check_release]
if: ${{ needs.check_release.outputs.should_release == 'true' }}
needs: [check-release]
if: ${{ needs.check-release.outputs.should_release == 'true' }}
runs-on: windows-2022
permissions:
actions: write
strategy:
matrix:
cuda: ['12.4', '13.3']
@@ -637,12 +691,6 @@ jobs:
cache: "npm"
cache-dependency-path: "tools/ui/package-lock.json"
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: release-windows-2022-x64-cuda-${{ matrix.cuda }}
append-timestamp: false # note: use this only with non-concurrent jobs!
- name: Install Cuda Toolkit
uses: ./.github/actions/windows-setup-cuda
with:
@@ -653,6 +701,11 @@ jobs:
run: |
choco install ninja
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: release-windows-2022-x64-cuda-${{ matrix.cuda }}
- name: Build
id: cmake_build
shell: cmd
@@ -669,6 +722,11 @@ jobs:
set /A NINJA_JOBS=%NUMBER_OF_PROCESSORS%-1
cmake --build build --config Release -j %NINJA_JOBS% --target ggml-cuda
- name: ccache-clear
uses: ./.github/actions/ccache-clear
with:
key: release-windows-2022-x64-cuda-${{ matrix.cuda }}
- name: Pack artifacts
id: pack_artifacts
run: |
@@ -748,7 +806,6 @@ jobs:
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: release-windows-2022-x64-sycl
# append-timestamp: false # note: use this only with non-concurrent jobs!
#
# - name: Build
# id: cmake_build
@@ -869,7 +926,6 @@ jobs:
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: release-ubuntu-24.04-sycl
# append-timestamp: false # note: use this only with non-concurrent jobs!
#
# - name: Build
# id: cmake_build
@@ -903,11 +959,14 @@ jobs:
# name: llama-bin-ubuntu-sycl-${{ matrix.build }}-x64.tar.gz
ubuntu-22-rocm:
needs: [check_release]
if: ${{ needs.check_release.outputs.should_release == 'true' }}
needs: [check-release]
if: ${{ needs.check-release.outputs.should_release == 'true' }}
runs-on: ubuntu-22.04
permissions:
actions: write
strategy:
matrix:
include:
@@ -938,7 +997,6 @@ jobs:
uses: ggml-org/ccache-action@v1.2.21
with:
key: release-ubuntu-22.04-rocm-${{ matrix.ROCM_VERSION }}
append-timestamp: false # note: use this only with non-concurrent jobs!
- name: Dependencies
id: depends
@@ -996,6 +1054,11 @@ jobs:
${{ env.CMAKE_ARGS }}
cmake --build build --config Release -j $(nproc)
- name: ccache-clear
uses: ./.github/actions/ccache-clear
with:
key: release-ubuntu-22.04-rocm-${{ matrix.ROCM_VERSION }}
- name: Determine tag name
id: tag
uses: ./.github/actions/get-tag-name
@@ -1016,11 +1079,14 @@ jobs:
name: llama-bin-ubuntu-rocm-${{ env.ROCM_VERSION_SHORT }}-${{ matrix.build }}.tar.gz
windows-hip:
needs: [check_release]
if: ${{ needs.check_release.outputs.should_release == 'true' }}
needs: [check-release]
if: ${{ needs.check-release.outputs.should_release == 'true' }}
runs-on: windows-2022
permissions:
actions: write
env:
HIPSDK_INSTALLER_VERSION: "26.Q1"
@@ -1060,7 +1126,6 @@ jobs:
uses: ggml-org/ccache-action@v1.2.21
with:
key: release-windows-2022-x64-hip-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ matrix.name }}
append-timestamp: false # note: use this only with non-concurrent jobs!
- name: Install ROCm
if: steps.cache-rocm.outputs.cache-hit != 'true'
@@ -1120,6 +1185,11 @@ jobs:
cp "${env:HIP_PATH}\bin\rocblas\library\*" "build\bin\rocblas\library\"
cp "${env:HIP_PATH}\bin\hipblaslt\library\*" "build\bin\hipblaslt\library\"
- name: ccache-clear
uses: ./.github/actions/ccache-clear
with:
key: release-windows-2022-x64-hip-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ matrix.name }}
- name: Pack artifacts
id: pack_artifacts
run: |
@@ -1131,10 +1201,10 @@ jobs:
path: llama-bin-win-hip-${{ matrix.name }}-x64.zip
name: llama-bin-win-hip-${{ matrix.name }}-x64.zip
ios-xcode-build:
needs: [check_release]
if: ${{ needs.check_release.outputs.should_release == 'true' }}
runs-on: macos-15
ios-xcode:
needs: [check-release]
if: ${{ needs.check-release.outputs.should_release == 'true' }}
runs-on: macos-26
steps:
- name: Checkout code
@@ -1144,7 +1214,7 @@ jobs:
- name: Setup Xcode
run: |
sudo xcode-select -s /Applications/Xcode_16.4.app
sudo xcode-select -s /Applications/Xcode_26.4.app
- name: Build
id: cmake_build
@@ -1160,7 +1230,7 @@ jobs:
-DLLAMA_BUILD_TESTS=OFF \
-DLLAMA_BUILD_SERVER=OFF \
-DCMAKE_SYSTEM_NAME=iOS \
-DCMAKE_OSX_DEPLOYMENT_TARGET=14.0 \
-DCMAKE_OSX_DEPLOYMENT_TARGET=16.0 \
-DCMAKE_XCODE_ATTRIBUTE_DEVELOPMENT_TEAM=ggml
cmake --build build --config Release -j $(sysctl -n hw.logicalcpu) -- CODE_SIGNING_ALLOWED=NO
@@ -1281,9 +1351,9 @@ jobs:
# path: llama-${{ steps.tag.outputs.name }}-bin-${{ matrix.chip_type }}-openEuler-${{ matrix.arch }}${{ matrix.use_acl_graph == 'on' && '-aclgraph' || '' }}.tar.gz
# name: llama-bin-${{ matrix.chip_type }}-openEuler-${{ matrix.arch }}${{ matrix.use_acl_graph == 'on' && '-aclgraph' || '' }}.tar.gz
ui-build:
needs: [check_release]
if: ${{ needs.check_release.outputs.should_release == 'true' }}
ui:
needs: [check-release]
if: ${{ needs.check-release.outputs.should_release == 'true' }}
uses: ./.github/workflows/ui-build.yml
release:
@@ -1309,9 +1379,9 @@ jobs:
#- ubuntu-24-sycl
- android-arm64
- macos-cpu
- ios-xcode-build
- ios-xcode
#- openEuler-cann
- ui-build
- ui
outputs:
tag_name: ${{ steps.tag.outputs.name }}
+20 -23
View File
@@ -55,21 +55,7 @@ concurrency:
jobs:
ubuntu:
runs-on: ubuntu-24.04
name: ubuntu (${{ matrix.wf_name }})
strategy:
matrix:
build_type: [Release]
wf_name: ["default"]
include:
- build_type: Release
extra_args: ""
wf_name: "default"
- build_type: Release
extra_args: "LLAMA_ARG_BACKEND_SAMPLING=1"
wf_name: "backend-sampling"
fail-fast: false
runs-on: ubuntu-24.04-arm
steps:
- name: Dependencies
@@ -96,7 +82,7 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: server-ubuntu-24.04-x64
key: server-ubuntu-24.04-arm
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
@@ -105,7 +91,7 @@ jobs:
run: |
cmake -B build \
-DGGML_SCHED_NO_REALLOC=ON
cmake --build build --config ${{ matrix.build_type }} -j $(nproc) --target llama-server
cmake --build build --config Release -j $(nproc) --target llama-server
- name: Python setup
id: setup_python
@@ -116,18 +102,30 @@ jobs:
- name: Tests
id: server_integration_tests
if: ${{ (!matrix.disabled_on_pr || !github.event.pull_request) }}
run: |
cd tools/server/tests
export ${{ matrix.extra_args }}
pytest -v -x -m "not slow"
- name: Slow tests
id: server_integration_tests_slow
if: ${{ (github.event.schedule || github.event.inputs.slow_tests == 'true') && matrix.build_type == 'Release' }}
if: ${{ github.event.schedule || github.event.inputs.slow_tests == 'true' }}
run: |
cd tools/server/tests
export ${{ matrix.extra_args }}
SLOW_TESTS=1 pytest -v -x
- name: Tests (Backend sampling)
id: server_integration_tests_backend_sampling
run: |
cd tools/server/tests
export LLAMA_ARG_BACKEND_SAMPLING=1
pytest -v -x -m "not slow"
- name: Slow tests (Backend sampling)
id: server_integration_tests_slow_backend_sampling
if: ${{ github.event.schedule || github.event.inputs.slow_tests == 'true' }}
run: |
cd tools/server/tests
export LLAMA_ARG_BACKEND_SAMPLING=1
SLOW_TESTS=1 pytest -v -x
windows:
@@ -169,7 +167,6 @@ jobs:
- name: Tests
id: server_integration_tests
if: ${{ !matrix.disabled_on_pr || !github.event.pull_request }}
run: |
cd tools/server/tests
$env:PYTHONIOENCODING = ":replace"
@@ -177,7 +174,7 @@ jobs:
- name: Slow tests
id: server_integration_tests_slow
if: ${{ (github.event.schedule || github.event.inputs.slow_tests == 'true') && matrix.build_type == 'Release' }}
if: ${{ github.event.schedule || github.event.inputs.slow_tests == 'true' }}
run: |
cd tools/server/tests
$env:SLOW_TESTS = "1"
@@ -0,0 +1,43 @@
name: UI Build (self-hosted)
on:
workflow_call:
jobs:
build:
runs-on: [self-hosted, fast]
env:
BRANCH_NAME: ${{ github.head_ref || github.ref_name }}
steps:
- name: Checkout code
uses: actions/checkout@v6
- name: Setup Node.js
uses: actions/setup-node@v6
with:
node-version: "24"
cache: "npm"
cache-dependency-path: "tools/ui/package-lock.json"
- name: Install dependencies
run: npm ci
working-directory: tools/ui
- name: Build application
run: npm run build
working-directory: tools/ui
- name: Generate checksums
run: |
cd tools/ui/dist
for f in *; do
sha256sum "$f" | awk '{print $1, $2}' >> checksums.txt
done
- name: Upload built UI
uses: actions/upload-artifact@v6
with:
name: ui-build
path: tools/ui/dist/
retention-days: 1
+1 -1
View File
@@ -5,7 +5,7 @@ on:
jobs:
build:
runs-on: [self-hosted, fast]
runs-on: ubuntu-slim
env:
BRANCH_NAME: ${{ github.head_ref || github.ref_name }}
+1 -1
View File
@@ -20,7 +20,7 @@ jobs:
publish:
name: Publish UI Static Output
needs: build
runs-on: ubuntu-24.04-arm
runs-on: ubuntu-slim
permissions:
contents: read
+3 -3
View File
@@ -16,7 +16,7 @@ on:
- master
paths: [
'.github/workflows/ui-self-hosted.yml',
'.github/workflows/ui-build.yml',
'.github/workflows/ui-build-self-hosted.yml',
'tools/ui/**.*',
'tools/server/tests/**.*'
]
@@ -24,7 +24,7 @@ on:
types: [opened, synchronize, reopened]
paths: [
'.github/workflows/ui-self-hosted.yml',
'.github/workflows/ui-build.yml',
'.github/workflows/ui-build-self-hosted.yml',
'tools/ui/**.*',
'tools/server/tests/**.*'
]
@@ -42,7 +42,7 @@ concurrency:
jobs:
ui-build:
name: Build static output
uses: ./.github/workflows/ui-build.yml
uses: ./.github/workflows/ui-build-self-hosted.yml
ui-checks:
name: Checks
-13
View File
@@ -222,19 +222,6 @@ if (LLAMA_BUILD_APP)
add_subdirectory(app)
endif()
# Automatically add all files from the 'licenses' directory
file(GLOB EXTRA_LICENSES "${CMAKE_SOURCE_DIR}/licenses/LICENSE-*")
foreach(FILE_PATH ${EXTRA_LICENSES})
get_filename_component(FILE_NAME "${FILE_PATH}" NAME)
string(REGEX REPLACE "^LICENSE-" "" NAME "${FILE_NAME}")
license_add_file("${NAME}" "${FILE_PATH}")
endforeach()
if (LLAMA_BUILD_COMMON)
license_generate(llama-common)
endif()
#
# install
#
+5 -5
View File
@@ -12,16 +12,16 @@
## Reporting a vulnerability
> [!IMPORTANT]
> The private security disclosure program is disabled until further notice. Please submit patches with fixes directly to the repo as public PRs. Emails will be ignored.
If you have discovered a security vulnerability in this project that falls inside the [covered topics](#covered-topics), please report it privately. **Do not disclose it as a public issue.** This gives us time to work with you to fix the issue before public exposure, reducing the chance that the exploit will be used before a patch is released.
Please disclose it as a private [security advisory](https://github.com/ggml-org/llama.cpp/security/advisories/new).
A team of volunteers on a reasonable-effort basis maintains this project. As such, please give us at least 90 days to work on a fix before public exposure.
> [!IMPORTANT]
> For collaborators: if you are interested in helping out with reviewing private security disclosures, please see: https://github.com/ggml-org/llama.cpp/discussions/18080
## Requirements
### Requirements
Before submitting your report, ensure you meet the following requirements:
@@ -31,7 +31,7 @@ Before submitting your report, ensure you meet the following requirements:
Maintainers reserve the right to close the report if these requirements are not fulfilled.
## Covered Topics
### Covered Topics
Only vulnerabilities that fall within these parts of the project are considered valid. For problems falling outside of this list, please report them as issues.
+11
View File
@@ -15,6 +15,17 @@ target_link_libraries(${TARGET} PRIVATE
)
target_compile_features(${TARGET} PRIVATE cxx_std_17)
# Automatically add all files from the 'licenses' directory
file(GLOB EXTRA_LICENSES "${CMAKE_SOURCE_DIR}/licenses/LICENSE-*")
foreach(FILE_PATH ${EXTRA_LICENSES})
get_filename_component(FILE_NAME "${FILE_PATH}" NAME)
string(REGEX REPLACE "^LICENSE-" "" NAME "${FILE_NAME}")
license_add_file("${NAME}" "${FILE_PATH}")
endforeach()
license_generate(${TARGET})
if(LLAMA_TOOLS_INSTALL)
install(TARGETS ${TARGET} RUNTIME)
endif()
+39 -7
View File
@@ -5,6 +5,9 @@
#include <string>
#include <vector>
// embedded data generated by cmake
extern const char * LICENSES[];
// visible
int llama_server(int argc, char ** argv);
int llama_cli(int argc, char ** argv);
@@ -17,8 +20,23 @@ int llama_fit_params(int argc, char ** argv);
int llama_quantize(int argc, char ** argv);
int llama_perplexity(int argc, char ** argv);
// hands the update over to the install script, which downloads and swaps the binary
static int llama_update(int argc, char ** argv) {
(void) argc;
(void) argv;
#if defined(_WIN32)
return system("powershell -NoProfile -ExecutionPolicy Bypass -Command \"irm https://llama.app/install.ps1 | iex\"");
#else
return system("curl -fsSL https://llama.app/install.sh | sh");
#endif
}
static const char * progname;
static int help(int argc, char ** argv);
static int version(int argc, char ** argv);
static int licenses(int argc, char ** argv);
struct command {
const char * name;
@@ -31,14 +49,16 @@ struct command {
static const command cmds[] = {
{"serve", "HTTP API server", {"server"}, false, llama_server },
{"cli", "Command-line interactive interface", {"client"}, false, llama_cli },
{"update", "Update llama to the latest release", {}, false, llama_update },
{"completion", "Text completion", {"complete"}, true, llama_completion },
{"bench", "Benchmark prompt processing and text generation", {}, true, llama_bench },
{"batched-bench", "Benchmark batched decoding performance", {}, true, llama_batched_bench},
{"fit-params", "Compute parameters to fit a model in device memory", {}, true, llama_fit_params },
{"quantize", "Quantize a model", {}, true, llama_quantize },
{"perplexity", "Compute model perplexity and KL divergence", {}, true, llama_perplexity },
{"version", "Show version", {}, true, version },
{"help", "Show available commands", {}, true, help },
{"version", "Show version", {}, false, version },
{"licenses", "Show third-party licenses", {"credits"}, false, licenses },
{"help", "Show available commands", {}, false, help },
};
static int version(int argc, char ** argv) {
@@ -46,17 +66,29 @@ static int version(int argc, char ** argv) {
return 0;
}
static int licenses(int argc, char ** argv) {
for (int i = 0; LICENSES[i]; ++i) {
printf("%s\n", LICENSES[i]);
}
return 0;
}
static int help(int argc, char ** argv) {
const bool show_all = argc >= 2 && std::string(argv[1]) == "all";
printf("Usage: llama <command> [options]\n\nAvailable commands:\n");
printf("Usage: %s <command> [options]\n\nAvailable commands:\n", progname);
for (const auto & cmd : cmds) {
if (show_all || !cmd.hidden) {
printf(" %-15s %s\n", cmd.name, cmd.desc);
}
}
printf("\nRun 'llama <command> --help' for command-specific usage.\n");
printf("\n");
if (!show_all) {
printf("Run '%s help all' to show additional commands.\n", progname);
}
printf("Run '%s <command> --help' for command-specific usage.\n", progname);
return 0;
}
@@ -74,13 +106,13 @@ static bool matches(const std::string & arg, const command & cmd) {
}
int main(int argc, char ** argv) {
progname = argv[0];
const std::string arg = argc >= 2 ? argv[1] : "help";
for (const auto & cmd : cmds) {
if (matches(arg, cmd)) {
// router spawns children through this same binary, it needs the
// subcommand to relaunch as 'llama serve' and not bare options
// keep cmd.name so the router's child processes re-invoke correctly
#ifdef _WIN32
_putenv_s("LLAMA_APP_CMD", cmd.name);
#else
+9 -7
View File
@@ -8,6 +8,7 @@ TVOS_MIN_OS_VERSION=16.4
BUILD_SHARED_LIBS=OFF
LLAMA_BUILD_APP=OFF
LLAMA_BUILD_COMMON=OFF
LLAMA_BUILD_EXAMPLES=OFF
LLAMA_BUILD_TOOLS=OFF
LLAMA_BUILD_TESTS=OFF
@@ -33,6 +34,7 @@ COMMON_CMAKE_ARGS=(
-DCMAKE_XCODE_ATTRIBUTE_DEVELOPMENT_TEAM=ggml
-DBUILD_SHARED_LIBS=${BUILD_SHARED_LIBS}
-DLLAMA_BUILD_APP=${LLAMA_BUILD_APP}
-DLLAMA_BUILD_COMMON=${LLAMA_BUILD_COMMON}
-DLLAMA_BUILD_EXAMPLES=${LLAMA_BUILD_EXAMPLES}
-DLLAMA_BUILD_TOOLS=${LLAMA_BUILD_TOOLS}
-DLLAMA_BUILD_TESTS=${LLAMA_BUILD_TESTS}
@@ -416,7 +418,7 @@ cmake -B build-ios-sim -G Xcode \
-DCMAKE_CXX_FLAGS="${COMMON_CXX_FLAGS}" \
-DLLAMA_OPENSSL=OFF \
-S .
cmake --build build-ios-sim --config Release -- -quiet
cmake --build build-ios-sim --config Release -j $(sysctl -n hw.logicalcpu) -- -quiet
echo "Building for iOS devices..."
cmake -B build-ios-device -G Xcode \
@@ -430,7 +432,7 @@ cmake -B build-ios-device -G Xcode \
-DCMAKE_CXX_FLAGS="${COMMON_CXX_FLAGS}" \
-DLLAMA_OPENSSL=OFF \
-S .
cmake --build build-ios-device --config Release -- -quiet
cmake --build build-ios-device --config Release -j $(sysctl -n hw.logicalcpu) -- -quiet
echo "Building for macOS..."
cmake -B build-macos -G Xcode \
@@ -441,7 +443,7 @@ cmake -B build-macos -G Xcode \
-DCMAKE_CXX_FLAGS="${COMMON_CXX_FLAGS}" \
-DLLAMA_OPENSSL=OFF \
-S .
cmake --build build-macos --config Release -- -quiet
cmake --build build-macos --config Release -j $(sysctl -n hw.logicalcpu) -- -quiet
echo "Building for visionOS..."
cmake -B build-visionos -G Xcode \
@@ -456,7 +458,7 @@ cmake -B build-visionos -G Xcode \
-DLLAMA_OPENSSL=OFF \
-DLLAMA_BUILD_SERVER=OFF \
-S .
cmake --build build-visionos --config Release -- -quiet
cmake --build build-visionos --config Release -j $(sysctl -n hw.logicalcpu) -- -quiet
echo "Building for visionOS simulator..."
cmake -B build-visionos-sim -G Xcode \
@@ -471,7 +473,7 @@ cmake -B build-visionos-sim -G Xcode \
-DLLAMA_OPENSSL=OFF \
-DLLAMA_BUILD_SERVER=OFF \
-S .
cmake --build build-visionos-sim --config Release -- -quiet
cmake --build build-visionos-sim --config Release -j $(sysctl -n hw.logicalcpu) -- -quiet
# Add tvOS builds (might need the same u_int definitions as watchOS and visionOS)
echo "Building for tvOS simulator..."
@@ -487,7 +489,7 @@ cmake -B build-tvos-sim -G Xcode \
-DCMAKE_CXX_FLAGS="${COMMON_CXX_FLAGS}" \
-DLLAMA_OPENSSL=OFF \
-S .
cmake --build build-tvos-sim --config Release -- -quiet
cmake --build build-tvos-sim --config Release -j $(sysctl -n hw.logicalcpu) -- -quiet
echo "Building for tvOS devices..."
cmake -B build-tvos-device -G Xcode \
@@ -502,7 +504,7 @@ cmake -B build-tvos-device -G Xcode \
-DCMAKE_CXX_FLAGS="${COMMON_CXX_FLAGS}" \
-DLLAMA_OPENSSL=OFF \
-S .
cmake --build build-tvos-device --config Release -- -quiet
cmake --build build-tvos-device --config Release -j $(sysctl -n hw.logicalcpu) -- -quiet
# Setup frameworks and copy binaries and headers
echo "Setting up framework structures..."
+42 -50
View File
@@ -50,8 +50,6 @@
#define LLAMA_MAX_URL_LENGTH 2084 // Maximum URL Length in Chrome: 2083
extern const char * LICENSES[];
using json = nlohmann::ordered_json;
using namespace common_arg_utils;
@@ -342,9 +340,7 @@ struct handle_model_result {
};
static handle_model_result common_params_handle_model(struct common_params_model & model,
const std::string & bearer_token,
bool offline,
bool search_mtp = false) {
const common_download_opts & opts) {
handle_model_result result;
if (!model.docker_repo.empty()) {
@@ -356,10 +352,9 @@ static handle_model_result common_params_handle_model(struct common_params_model
model.hf_file = model.path;
model.path = "";
}
common_download_opts opts;
opts.bearer_token = bearer_token;
opts.offline = offline;
auto download_result = common_download_model(model, opts, true, search_mtp);
common_download_opts hf_opts = opts;
hf_opts.download_mmproj = true; // also look for mmproj when downloading hf model
auto download_result = common_download_model(model, hf_opts);
if (download_result.model_path.empty()) {
throw std::runtime_error("failed to download model from Hugging Face");
@@ -384,9 +379,6 @@ static handle_model_result common_params_handle_model(struct common_params_model
model.path = fs_get_cache_file(string_split<std::string>(f, '/').back());
}
common_download_opts opts;
opts.bearer_token = bearer_token;
opts.offline = offline;
auto download_result = common_download_model(model, opts);
if (download_result.model_path.empty()) {
throw std::runtime_error("failed to download model from " + model.url);
@@ -443,35 +435,49 @@ static bool parse_bool_value(const std::string & value) {
// CLI argument parsing functions
//
void common_params_handle_models(common_params & params, llama_example curr_ex) {
bool common_params_handle_models(common_params & params, llama_example curr_ex) {
const bool spec_type_draft_mtp = std::find(params.speculative.types.begin(),
params.speculative.types.end(),
COMMON_SPECULATIVE_TYPE_DRAFT_MTP) != params.speculative.types.end();
auto res = common_params_handle_model(params.model, params.hf_token, params.offline, spec_type_draft_mtp);
if (params.no_mmproj) {
params.mmproj = {};
} else if (res.found_mmproj && params.mmproj.path.empty() && params.mmproj.url.empty()) {
// optionally, handle mmproj model when -hf is specified
params.mmproj = res.mmproj;
}
// only download mmproj if the current example is using it
for (const auto & ex : mmproj_examples) {
if (curr_ex == ex) {
common_params_handle_model(params.mmproj, params.hf_token, params.offline);
break;
common_download_opts opts;
opts.bearer_token = params.hf_token;
opts.offline = params.offline;
opts.skip_download = params.skip_download;
opts.download_mtp = spec_type_draft_mtp;
try {
auto res = common_params_handle_model(params.model, opts);
if (params.no_mmproj) {
params.mmproj = {};
} else if (res.found_mmproj && params.mmproj.path.empty() && params.mmproj.url.empty()) {
// optionally, handle mmproj model when -hf is specified
params.mmproj = res.mmproj;
}
// only download mmproj if the current example is using it
for (const auto & ex : mmproj_examples) {
if (curr_ex == ex) {
common_params_handle_model(params.mmproj, opts);
break;
}
}
// when --spec-type mtp is set and no draft model was provided explicitly,
// fall back to the MTP head discovered alongside the -hf model
if (spec_type_draft_mtp && res.found_mtp &&
params.speculative.draft.mparams.path.empty() &&
params.speculative.draft.mparams.hf_repo.empty() &&
params.speculative.draft.mparams.url.empty()) {
params.speculative.draft.mparams.path = res.mtp.path;
}
common_params_handle_model(params.speculative.draft.mparams, opts);
common_params_handle_model(params.vocoder.model, opts);
return true;
} catch (const common_skip_download_exception &) {
return false;
} catch (const std::exception &) {
throw;
}
// when --spec-type mtp is set and no draft model was provided explicitly,
// fall back to the MTP head discovered alongside the -hf model
if (spec_type_draft_mtp && res.found_mtp &&
params.speculative.draft.mparams.path.empty() &&
params.speculative.draft.mparams.hf_repo.empty() &&
params.speculative.draft.mparams.url.empty()) {
params.speculative.draft.mparams.path = res.mtp.path;
}
common_params_handle_model(params.speculative.draft.mparams, params.hf_token, params.offline);
common_params_handle_model(params.vocoder.model, params.hf_token, params.offline);
}
static bool common_params_parse_ex(int argc, char ** argv, common_params_context & ctx_arg) {
@@ -1035,11 +1041,9 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
// we define here to make sure it's included in llama-gen-docs
if (ex == LLAMA_EXAMPLE_COMPLETION) {
params.use_jinja = false; // disable jinja by default
} else if (ex == LLAMA_EXAMPLE_MTMD) {
params.use_jinja = false; // disable jinja by default
params.sampling.temp = 0.2; // lower temp by default for better quality
} else if (ex == LLAMA_EXAMPLE_SERVER) {
params.n_parallel = -1; // auto by default
}
@@ -1060,7 +1064,6 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
sampler_type_names.pop_back(); // remove last semicolon
}
/**
* filter options by example
* rules:
@@ -1074,7 +1077,6 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
}
};
add_opt(common_arg(
{"-h", "--help", "--usage"},
"print usage and exit",
@@ -1091,16 +1093,6 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
exit(0);
}
));
add_opt(common_arg(
{"--license"},
"show source code license and dependencies",
[](common_params &) {
for (int i = 0; LICENSES[i]; ++i) {
printf("%s\n", LICENSES[i]);
}
exit(0);
}
));
add_opt(common_arg(
{"-cl", "--cache-list"},
"show list of models in cache",
@@ -2998,7 +2990,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
}
key_file.close();
}
).set_examples({LLAMA_EXAMPLE_SERVER}));
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_API_KEY_FILE"));
add_opt(common_arg(
{"--ssl-key-file"}, "FNAME",
"path to file a PEM-encoded SSL private key",
+5 -2
View File
@@ -129,8 +129,11 @@ bool common_params_to_map(int argc, char ** argv, llama_example ex, std::map<com
// see: https://github.com/ggml-org/llama.cpp/issues/18163
void common_params_add_preset_options(std::vector<common_arg> & args);
// Populate model paths (main model, mmproj, etc) from -hf if necessary
void common_params_handle_models(common_params & params, llama_example curr_ex);
// populate model paths (main model, mmproj, etc) from -hf if necessary
// return true if the model is ready to use
// throw an exception if there is an error that prevents the model from being used (e.g. network error, model not found, etc)
// if params.skip_download is true, no downloads will be attempted. return false if the model is invalid or missing (e.g. ETag check failed)
bool common_params_handle_models(common_params & params, llama_example curr_ex);
// initialize argument parser context - used by test-arg-parser and preset
common_params_context common_params_parser_init(common_params & params, llama_example ex, void(*print_usage)(int, char **) = nullptr);
+1
View File
@@ -1563,6 +1563,7 @@ struct llama_context_params common_context_params_to_llama(const common_params &
cparams.n_ctx = params.n_ctx;
cparams.n_seq_max = params.n_parallel;
cparams.n_rs_seq = params.speculative.need_n_rs_seq();
cparams.n_outputs_max = std::max(params.n_outputs_max, 0);
cparams.n_batch = params.n_batch;
cparams.n_ubatch = params.n_ubatch;
cparams.n_threads = params.cpuparams.n_threads;
+4 -2
View File
@@ -431,6 +431,7 @@ struct common_params {
int32_t n_chunks = -1; // max number of chunks to process (-1 = unlimited)
int32_t n_parallel = 1; // number of parallel sequences to decode
int32_t n_sequences = 1; // number of sequences to decode
int32_t n_outputs_max = 0; // max outputs in a batch (0 = n_batch)
int32_t grp_attn_n = 1; // group-attention factor
int32_t grp_attn_w = 512; // group-attention width
int32_t n_print = -1; // print token count every n tokens (-1 = disabled)
@@ -479,7 +480,7 @@ struct common_params {
std::set<std::string> model_alias; // model aliases // NOLINT
std::set<std::string> model_tags; // model tags (informational, not used for routing) // NOLINT
std::string hf_token = ""; // HF token // NOLINT
std::string hf_token = ""; // HF token (aka bearer token) // NOLINT
std::string prompt = ""; // NOLINT
std::string system_prompt = ""; // NOLINT
std::string prompt_file = ""; // store the external prompt file name // NOLINT
@@ -507,6 +508,7 @@ struct common_params {
int32_t control_vector_layer_start = -1; // layer range for control vector
int32_t control_vector_layer_end = -1; // layer range for control vector
bool offline = false;
bool skip_download = false; // skip model file downloading
int32_t ppl_stride = 0; // stride for perplexity calculations. If left at 0, the pre-existing approach will be used.
int32_t ppl_output_type = 0; // = 0 -> ppl output is as usual, = 1 -> ppl output is num_tokens, ppl, one per line
@@ -587,7 +589,7 @@ struct common_params {
// server params
int32_t port = 8080; // server listens on this network port
bool reuse_port = false; // allow multiple sockets to bind to the same port
int32_t timeout_read = 600; // http read timeout in seconds
int32_t timeout_read = 3600; // http read timeout in seconds
int32_t timeout_write = timeout_read; // http write timeout in seconds
int32_t n_threads_http = -1; // number of threads to process HTTP requests (TODO: support threadpool)
int32_t n_cache_reuse = 0; // min chunk size to reuse from the cache via KV shifting
+19 -7
View File
@@ -292,6 +292,10 @@ static int common_download_file_single_online(const std::string & url,
const bool file_exists = std::filesystem::exists(path);
if (!file_exists && opts.skip_download) {
return -2; // file is missing and download is disabled
}
if (file_exists && skip_etag) {
LOG_DBG("%s: using cached file: %s\n", __func__, path.c_str());
return 304; // 304 Not Modified - fake cached response
@@ -357,6 +361,10 @@ static int common_download_file_single_online(const std::string & url,
LOG_DBG("%s: using cached file (same etag): %s\n", __func__, path.c_str());
return 304; // 304 Not Modified - fake cached response
}
// pass this point, the file exists but is different from the server version, so we need to redownload it
if (opts.skip_download) {
return -2; // special code to indicate that the download was skipped due to etag mismatch
}
if (remove(path.c_str()) != 0) {
LOG_ERR("%s: unable to delete file: %s\n", __func__, path.c_str());
return -1;
@@ -775,13 +783,13 @@ static std::vector<download_task> get_url_tasks(const common_params_model & mode
}
common_download_model_result common_download_model(const common_params_model & model,
const common_download_opts & opts,
bool download_mmproj,
bool download_mtp) {
const common_download_opts & opts) {
common_download_model_result result;
std::vector<download_task> tasks;
hf_plan hf;
bool download_mmproj = opts.download_mmproj;
bool download_mtp = opts.download_mtp;
bool is_hf = !model.hf_repo.empty();
if (is_hf) {
@@ -806,18 +814,22 @@ common_download_model_result common_download_model(const common_params_model &
return result;
}
std::vector<std::future<bool>> futures;
std::vector<std::future<int>> futures;
for (const auto & task : tasks) {
futures.push_back(std::async(std::launch::async,
[&task, &opts, is_hf]() {
int status = common_download_file_single(task.url, task.path, opts, is_hf);
return is_http_status_ok(status);
return common_download_file_single(task.url, task.path, opts, is_hf);
}
));
}
for (auto & f : futures) {
if (!f.get()) {
int status = f.get();
if (status == -2 && opts.skip_download) {
throw common_skip_download_exception();
}
bool is_ok = is_http_status_ok(status);
if (!is_ok) {
return {};
}
}
+10 -3
View File
@@ -52,6 +52,9 @@ struct common_download_opts {
std::string bearer_token;
common_header_list headers;
bool offline = false;
bool skip_download = false; // if true, only validation is performed, common_skip_download_exception may be thrown if the file is missing or invalid
bool download_mmproj = false;
bool download_mtp = false;
common_download_callback * callback = nullptr;
};
@@ -62,6 +65,11 @@ struct common_download_model_result {
std::string mtp_path;
};
// throw if the file is missing or invalid (e.g. ETag check failed)
struct common_skip_download_exception : public std::runtime_error {
common_skip_download_exception() : std::runtime_error("skip download") {}
};
// Download model from HuggingFace repo or URL
//
// input (via model struct):
@@ -89,9 +97,7 @@ struct common_download_model_result {
// returns result with model_path, mmproj_path and mtp_path (empty when not found / on failure)
common_download_model_result common_download_model(
const common_params_model & model,
const common_download_opts & opts = {},
bool download_mmproj = false,
bool download_mtp = false
const common_download_opts & opts = {}
);
// returns list of cached models
@@ -99,6 +105,7 @@ std::vector<common_cached_model_info> common_list_cached_models();
// download single file from url to local path
// returns status code or -1 on error
// returns -2 if the download was skipped due to ETag mismatch (file outdated, skip_download=true)
// skip_etag: if true, don't read/write .etag files (for HF cache where filename is the hash)
int common_download_file_single(const std::string & url,
const std::string & path,
+2
View File
@@ -1,5 +1,7 @@
#include "ngram-mod.h"
#include <algorithm>
//
// common_ngram_mod
//
+21
View File
@@ -247,3 +247,24 @@ common_reasoning_budget_state common_reasoning_budget_get_state(const struct lla
}
return ((const common_reasoning_budget_ctx *)smpl->ctx)->state;
}
bool common_reasoning_budget_force(struct llama_sampler * smpl) {
if (!smpl) {
return false;
}
auto * ctx = (common_reasoning_budget_ctx *) smpl->ctx;
// only a sampler that is actively counting down the budget may be forced;
// any other state (idle, already forcing/waiting, or done) is left untouched
if (ctx->state != REASONING_BUDGET_COUNTING) {
return false;
}
ctx->state = REASONING_BUDGET_FORCING;
ctx->force_pos = 0;
ctx->end_matcher.reset();
LOG_INF("reasoning-budget: forced into forcing state (manual transition)\n");
return true;
}
+4
View File
@@ -40,3 +40,7 @@ struct llama_sampler * common_reasoning_budget_init(
common_reasoning_budget_state initial_state = REASONING_BUDGET_IDLE);
common_reasoning_budget_state common_reasoning_budget_get_state(const struct llama_sampler * smpl);
// Manually transition the reasoning budget sampler into the FORCING state.
// Returns true if the transition occurred.
bool common_reasoning_budget_force(struct llama_sampler * smpl);
+8
View File
@@ -661,6 +661,14 @@ uint32_t common_sampler_get_seed(const struct common_sampler * gsmpl) {
return llama_sampler_get_seed(gsmpl->chain);
}
bool common_sampler_reasoning_budget_force(struct common_sampler * gsmpl) {
if (!gsmpl) {
return false;
}
return common_reasoning_budget_force(gsmpl->rbudget);
}
// helpers
llama_token_data_array * common_sampler_get_candidates(struct common_sampler * gsmpl, bool do_sort) {
+3
View File
@@ -87,6 +87,9 @@ std::vector<llama_token> common_sampler_sample_and_accept_n(struct common_sample
uint32_t common_sampler_get_seed(const struct common_sampler * gsmpl);
// force the reasoning budget sampler (if any) to begin forcing its end sequence now.
bool common_sampler_reasoning_budget_force(struct common_sampler * gsmpl);
// helpers
// access the internal list of current candidate tokens
+34 -12
View File
@@ -1317,6 +1317,40 @@ static uint32_t common_get_enabled_speculative_configs(const std::vector<common_
return result;
}
int32_t common_speculative_n_max(const common_params_speculative * spec) {
int32_t n_max = 0;
for (const auto type : spec->types) {
switch (type) {
case COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE:
case COMMON_SPECULATIVE_TYPE_DRAFT_EAGLE3:
case COMMON_SPECULATIVE_TYPE_DRAFT_MTP:
n_max = std::max(n_max, std::max(0, spec->draft.n_max));
break;
case COMMON_SPECULATIVE_TYPE_NGRAM_SIMPLE:
n_max = std::max(n_max, (int32_t) spec->ngram_simple.size_m);
break;
case COMMON_SPECULATIVE_TYPE_NGRAM_MAP_K:
n_max = std::max(n_max, (int32_t) spec->ngram_map_k.size_m);
break;
case COMMON_SPECULATIVE_TYPE_NGRAM_MAP_K4V:
n_max = std::max(n_max, (int32_t) spec->ngram_map_k4v.size_m);
break;
case COMMON_SPECULATIVE_TYPE_NGRAM_MOD:
n_max = std::max(n_max, std::max(0, spec->ngram_mod.n_max));
break;
case COMMON_SPECULATIVE_TYPE_NGRAM_CACHE:
n_max = std::max(n_max, (int32_t) 8);
break;
case COMMON_SPECULATIVE_TYPE_NONE:
case COMMON_SPECULATIVE_TYPE_COUNT:
break;
}
}
return n_max;
}
// initialization of the speculative decoding system
//
common_speculative * common_speculative_init(common_params_speculative & params, uint32_t n_seq) {
@@ -1325,8 +1359,6 @@ common_speculative * common_speculative_init(common_params_speculative & params,
{
uint32_t enabled_configs = common_get_enabled_speculative_configs(params.types);
bool has_draft_model_path = !params.draft.mparams.path.empty();
bool has_draft_simple = (enabled_configs & (1u << COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE));
bool has_draft_eagle3 = false; // TODO PR-18039: if params.speculative.eagle3
bool has_mtp = (enabled_configs & (1u << COMMON_SPECULATIVE_TYPE_DRAFT_MTP)) && params.draft.ctx_dft != nullptr;
@@ -1359,16 +1391,6 @@ common_speculative * common_speculative_init(common_params_speculative & params,
if (has_ngram_cache) {
configs.push_back(common_speculative_config(COMMON_SPECULATIVE_TYPE_NGRAM_CACHE, params));
}
if (has_draft_simple) {
if (!has_draft_model_path) {
LOG_WRN("%s: draft model is not specified - cannot use 'draft' type\n", __func__);
has_draft_simple = false;
}
} else if (has_draft_model_path && !has_mtp && !has_draft_eagle3) {
LOG_WRN("%s: draft model is specified but 'draft' speculative type is not explicitly enabled - enabling it\n", __func__);
has_draft_simple = true;
}
if (has_draft_simple) {
configs.push_back(common_speculative_config(COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE, params));
}
+3
View File
@@ -20,6 +20,9 @@ enum common_speculative_type common_speculative_type_from_name(const std::string
// convert type to string
std::string common_speculative_type_to_str(enum common_speculative_type type);
// return the max number of draft tokens based on the speculative parameters
int32_t common_speculative_n_max(const common_params_speculative * spec);
common_speculative * common_speculative_init(common_params_speculative & params, uint32_t n_seq);
void common_speculative_free(common_speculative * spec);
+4
View File
@@ -47,6 +47,7 @@ TEXT_MODEL_MAP: dict[str, str] = {
"DeepseekForCausalLM": "deepseek",
"DeepseekV2ForCausalLM": "deepseek",
"DeepseekV3ForCausalLM": "deepseek",
"DeepseekV32ForCausalLM": "deepseek",
"DistilBertForMaskedLM": "bert",
"DistilBertForSequenceClassification": "bert",
"DistilBertModel": "bert",
@@ -57,6 +58,7 @@ TEXT_MODEL_MAP: dict[str, str] = {
"Ernie4_5_ForCausalLM": "ernie",
"Ernie4_5_MoeForCausalLM": "ernie",
"EuroBertModel": "bert",
"Exaone4_5_ForConditionalGeneration": "exaone",
"Exaone4ForCausalLM": "exaone",
"ExaoneForCausalLM": "exaone",
"ExaoneMoEForCausalLM": "exaone",
@@ -236,8 +238,10 @@ TEXT_MODEL_MAP: dict[str, str] = {
MMPROJ_MODEL_MAP: dict[str, str] = {
"AudioFlamingo3ForConditionalGeneration": "ultravox",
"CogVLMForCausalLM": "cogvlm",
"DeepseekOCR2ForCausalLM": "deepseek",
"DeepseekOCRForCausalLM": "deepseek",
"DotsOCRForCausalLM": "dotsocr",
"Exaone4_5_ForConditionalGeneration": "exaone",
"Gemma3ForConditionalGeneration": "gemma",
"Gemma3nForConditionalGeneration": "gemma",
"Gemma4ForConditionalGeneration": "gemma",
+18 -3
View File
@@ -915,6 +915,8 @@ class ModelBase:
gguf.MODEL_TENSOR.SSM_CONV1D_Q,
gguf.MODEL_TENSOR.SSM_CONV1D_K,
gguf.MODEL_TENSOR.SSM_CONV1D_V,
# DSA indexer weights should be F32
gguf.MODEL_TENSOR.INDEXER_PROJ,
)
)
or new_name[-7:] not in (".weight", ".lora_a", ".lora_b")
@@ -1138,7 +1140,7 @@ class TextModel(ModelBase):
# Skip multimodal tensors
if name.startswith(("mlp", "vit.", "vpm.", "siglip2.", "conformer.", "merger.", "resampler.", "sound_encoder.", "sound_projection.", "speech_embeddings.")) \
or "visual." in name or "vision." in name or "audio." in name or "talker." in name \
or "vision_" in name or "audio_" in name or "sam_model" in name \
or "vision_" in name or "audio_" in name \
or "token2wav." in name or "code2wav." in name \
or "projector." in name or "pre_mm_projector_norm" in name \
or "image_newline" in name or "view_seperator" in name \
@@ -1445,6 +1447,9 @@ class TextModel(ModelBase):
if chkhsh == "0fe1cf6eda062318a1af7270f3331a85c539a01778ff948e24388e949c5282f4":
# ref: https://huggingface.co/evilfreelancer/ruGPT3XL
res = "gpt-2"
if chkhsh == "9e454714343b69b99b71795c1d27a68c2a1d15dab111f4d353109f966af29da7":
# ref: https://huggingface.co/LiquidAI/LFM2.5-8B-A1B
res = "lfm2"
if chkhsh == "0ef9807a4087ebef797fc749390439009c3b9eda9ad1a097abbe738f486c01e5":
# ref: https://huggingface.co/meta-llama/Meta-Llama-3-8B
res = "llama-bpe"
@@ -1596,7 +1601,7 @@ class TextModel(ModelBase):
# ref: https://huggingface.co/K-intelligence/Midm-2.0-Base-Instruct
res = "midm-2.0"
if chkhsh == "169bf0296a13c4d9b7672313f749eb36501d931022de052aad6e36f2bf34dd51":
# ref: https://huggingface.co/LiquidAI/LFM2-Tokenizer
# ref: https://huggingface.co/LiquidAI/LFM2.5-350M
res = "lfm2"
if chkhsh == "2085e1638f6c377a0aa4ead21b27bb4cb941bf800df86ed391011769c1758dfb":
# ref: https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B
@@ -1687,6 +1692,16 @@ class TextModel(ModelBase):
special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=True)
special_vocab.add_to_gguf(self.gguf_writer)
def _set_vocab_whitespace(self) -> None:
tokens, toktypes, _ = self.get_vocab_base()
self.gguf_writer.add_tokenizer_model("whitespace")
self.gguf_writer.add_tokenizer_pre("whitespace") # pinned, not hash-detected: chktxt hash collides with jina-v1-en
self.gguf_writer.add_token_list(tokens)
self.gguf_writer.add_token_types(toktypes)
special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=True)
special_vocab.add_to_gguf(self.gguf_writer)
def _set_vocab_hybriddna(self):
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(self.dir_model, trust_remote_code=True)
@@ -2578,7 +2593,7 @@ def get_model_architecture(hparams: dict[str, Any], model_type: ModelType) -> st
# Step3-VL keeps text config under text_config but uses a custom top-level architecture.
# For text conversion we route to a dedicated text-only class.
# TODO: refactor this later to avoid adding exception here
if model_type == ModelType.TEXT and arch in ("StepVLForConditionalGeneration", "Sarashina2VisionForCausalLM"):
if model_type == ModelType.TEXT and arch in ("StepVLForConditionalGeneration", "Sarashina2VisionForCausalLM", "Exaone4_5_ForConditionalGeneration"):
return arch
# if "architectures" is found in the sub-config, use that instead
+10 -1
View File
@@ -571,7 +571,16 @@ class JinaBertV2Model(BertModel):
if tokenizer_class == 'BertTokenizer':
super().set_vocab()
elif tokenizer_class == 'RobertaTokenizer':
self._set_vocab_gpt2()
pre_tokenizer_type = None
tokenizer_json_path = self.dir_model / "tokenizer.json"
if tokenizer_json_path.is_file():
with open(tokenizer_json_path, "r", encoding="utf-8") as f:
pre_tokenizer_type = json.load(f).get("pre_tokenizer", {}).get("type")
if pre_tokenizer_type == "Whitespace":
self._set_vocab_whitespace()
else:
self._set_vocab_gpt2()
self.gguf_writer.add_token_type_count(2)
else:
raise NotImplementedError(f'Tokenizer {tokenizer_class} is not supported for JinaBertModel')
+85 -12
View File
@@ -16,10 +16,14 @@ from .qwen import QwenModel
@ModelBase.register("DeepseekOCRForCausalLM")
class DeepseekOCRVisionModel(MmprojModel):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.clip_projector_type = gguf.VisionProjectorType.DEEPSEEKOCR
def set_gguf_parameters(self):
super().set_gguf_parameters()
hparams = self.hparams
self.gguf_writer.add_clip_projector_type(gguf.VisionProjectorType.DEEPSEEKOCR)
self.gguf_writer.add_clip_projector_type(self.clip_projector_type)
# default values below are taken from HF tranformers code
self.gguf_writer.add_vision_attention_layernorm_eps(hparams.get("layer_norm_eps", 1e-6))
self.gguf_writer.add_vision_use_gelu(True)
@@ -49,22 +53,27 @@ class DeepseekOCRVisionModel(MmprojModel):
raise ValueError("DeepseekOCR model requires 'vision_config' in the model configuration, but it was not found")
vision_config['sam'] = vision_config['width']['sam_vit_b']
vision_config.update(vision_config['width']['clip-l-14-224'])
vision_config['hidden_size'] = vision_config['width']
vision_config['num_heads'] = vision_config['heads']
vision_config['intermediate_size'] = vision_config['heads'] * 4
if vision_config['width'].get('clip-l-14-224') is not None:
vision_config.update(vision_config['width']['clip-l-14-224'])
if isinstance(vision_config['width'], int):
vision_config['hidden_size'] = vision_config['width']
if vision_config.get('heads') is not None:
vision_config['num_heads'] = vision_config['heads']
vision_config['intermediate_size'] = vision_config['heads'] * 4
return vision_config
def tensor_force_quant(self, name, new_name, bid, n_dims):
if ".embeddings." in name or 'pos_embed' in name:
return gguf.GGMLQuantizationType.F32
if ".rel_pos_h" in name or '.rel_pos_w' in name:
return gguf.GGMLQuantizationType.F32
if ".neck." in name or ".net_" in name:
return gguf.GGMLQuantizationType.F32
for nq_name in ('.embeddings.', 'pos_embed', '.rel_pos_h', '.rel_pos_w', '.neck.', '.net_'):
if nq_name in name:
return gguf.GGMLQuantizationType.F32
return super().tensor_force_quant(name, new_name, bid, n_dims)
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
if name.endswith("view_seperator"):
data_torch = data_torch.unsqueeze(0)
yield from super().modify_tensors(data_torch, name, bid)
@classmethod
def filter_tensors(cls, item: tuple[str, Callable[[], Tensor]]) -> tuple[str, Callable[[], Tensor]] | None:
name, gen = item
@@ -81,6 +90,33 @@ class DeepseekOCRVisionModel(MmprojModel):
return super().filter_tensors((name, gen))
@ModelBase.register("DeepseekOCR2ForCausalLM")
class DeepseekOCR2VisionModel(DeepseekOCRVisionModel):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.clip_projector_type = gguf.VisionProjectorType.DEEPSEEKOCR2
def set_gguf_parameters(self):
# the vision tower's qwen2 encoder is built from fixed defaults,
# see build_qwen2_decoder_as_encoder() in deepencoderv2.py
if self.hparams.get("patch_size") is None:
self.hparams["patch_size"] = 16
if self.hparams.get("intermediate_size") is None:
self.hparams["intermediate_size"] = 4864
if self.hparams.get("num_attention_heads") is None:
self.hparams["num_attention_heads"] = 14
super().set_gguf_parameters()
# qwen2 encoder is GQA: 14 Q heads, 2 KV heads
self.gguf_writer.add_vision_head_count_kv(2)
def get_vision_config(self) -> dict[str, Any]:
vision_config = super().get_vision_config()
vision_config['hidden_size'] = vision_config['width']['qwen2-0-5b']['dim']
if vision_config.get('layers') is None:
vision_config['layers'] = 24
return vision_config
@ModelBase.register("DeepseekForCausalLM")
class DeepseekModel(TextModel):
model_arch = gguf.MODEL_ARCH.DEEPSEEK
@@ -188,13 +224,21 @@ class DeepseekV2Model(TextModel):
self.origin_hf_arch = hparams.get('architectures', [None])[0]
# special handling for Deepseek OCR
if self.origin_hf_arch == "DeepseekOCRForCausalLM":
if self.origin_hf_arch in ("DeepseekOCRForCausalLM", "DeepseekOCR2ForCausalLM"):
self.model_arch = gguf.MODEL_ARCH.DEEPSEEK2OCR
self.gguf_writer.arch = gguf.MODEL_ARCH_NAMES[self.model_arch]
self.gguf_writer.add_architecture()
# default jinja template
self.gguf_writer.add_chat_template("{% for m in messages %}{{m['content']}}{% endfor %}")
@classmethod
def filter_tensors(cls, item: tuple[str, Callable[[], Tensor]]) -> tuple[str, Callable[[], Tensor]] | None:
name, _ = item
# DeepSeek-OCR vision encoder (SAM + DeepSeek-OCR-2 qwen2 tower)
if "sam_model" in name or "qwen2_model" in name:
return None
return super().filter_tensors(item)
def set_vocab(self):
try:
self._set_vocab_gpt2()
@@ -386,3 +430,32 @@ class DeepseekV2Model(TextModel):
experts = [k for d in self._experts for k in d.keys()]
if len(experts) > 0:
raise ValueError(f"Unprocessed experts: {experts}")
@ModelBase.register("DeepseekV32ForCausalLM")
class DeepseekV32Model(DeepseekV2Model):
model_arch = gguf.MODEL_ARCH.DEEPSEEK32
skip_mtp = False
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.block_count = self.hparams["num_hidden_layers"] + self.hparams.get("num_nextn_predict_layers", 0)
self.tensor_map = gguf.get_tensor_name_map(self.model_arch, self.block_count)
def set_vocab(self):
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(self.dir_model)
assert getattr(tokenizer, "add_bos_token", False), "Change value of add_bos_token to true in tokenizer_config.json file."
self._set_vocab_gpt2()
def set_gguf_parameters(self):
super().set_gguf_parameters()
# NextN/MTP prediction layers
if (num_nextn_predict_layers := self.hparams.get("num_nextn_predict_layers")) is not None:
self.gguf_writer.add_nextn_predict_layers(num_nextn_predict_layers)
# DSA indexer parameters
self.gguf_writer.add_indexer_head_count(self.hparams["index_n_heads"])
self.gguf_writer.add_indexer_key_length(self.hparams["index_head_dim"])
self.gguf_writer.add_indexer_top_k(self.hparams["index_topk"])
+97 -2
View File
@@ -3,14 +3,15 @@ from __future__ import annotations
import math
from pathlib import Path
from typing import Iterable, TYPE_CHECKING
from typing import Callable, Iterable, TYPE_CHECKING
import torch
if TYPE_CHECKING:
from torch import Tensor
from .base import ModelBase, TextModel, gguf
from .base import MmprojModel, ModelBase, TextModel, gguf
from .qwenvl import Qwen2VLVisionModel
@ModelBase.register("ExaoneForCausalLM")
@@ -208,3 +209,97 @@ class ExaoneMoEModel(Exaone4Model):
experts = [k for d in self._experts for k in d.keys()]
if len(experts) > 0:
raise ValueError(f"Unprocessed experts: {experts}")
@ModelBase.register("Exaone4_5_ForConditionalGeneration")
class Exaone4_5_TextModel(Exaone4Model):
"""Text tower of EXAONE 4.5; Tensors match EXAONE4"""
model_arch = gguf.MODEL_ARCH.EXAONE4
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
n_nextn = int(self.hparams.get("num_nextn_predict_layers", 0) or 0)
if n_nextn > 0:
self.block_count = self.hparams["num_hidden_layers"] + n_nextn
self.tensor_map = gguf.get_tensor_name_map(self.model_arch, self.block_count)
def set_gguf_parameters(self):
super().set_gguf_parameters()
n_nextn = int(self.hparams.get("num_nextn_predict_layers", 0) or 0)
if n_nextn > 0:
self.gguf_writer.add_nextn_predict_layers(n_nextn)
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
if name.startswith("mtp."):
n_nextn = int(self.hparams.get("num_nextn_predict_layers", 0) or 0)
if n_nextn <= 0:
return
nh = self.hparams["num_hidden_layers"]
if ".layers." in name:
share = self.hparams.get("mtp_share_layers", False)
mtp_bid = bid if bid is not None else 0
if share:
for k in range(n_nextn):
nn = name.replace(f"mtp.layers.{mtp_bid}", f"model.layers.{nh + k}")
yield from super().modify_tensors(data_torch, nn, nh + k)
return
name = name.replace(f"mtp.layers.{mtp_bid}", f"model.layers.{mtp_bid + nh}")
else:
remapper = {
"mtp.fc": gguf.MODEL_TENSOR.NEXTN_EH_PROJ,
"mtp.pre_fc_norm_embedding": gguf.MODEL_TENSOR.NEXTN_ENORM,
"mtp.pre_fc_norm_hidden": gguf.MODEL_TENSOR.NEXTN_HNORM,
"mtp.norm": gguf.MODEL_TENSOR.NEXTN_SHARED_HEAD_NORM,
}
_n = Path(name)
key = _n.stem
if key not in remapper:
return
for bid_mtp in range(nh, self.block_count):
mapped_name = self.format_tensor_name(remapper[key], bid_mtp, suffix=_n.suffix)
yield from ModelBase.modify_tensors(self, data_torch, mapped_name, bid_mtp)
return
yield from super().modify_tensors(data_torch, name, bid)
@ModelBase.register("Exaone4_5_ForConditionalGeneration")
class Exaone4_5VisionModel(Qwen2VLVisionModel):
"""Vision tower for EXAONE 4.5; Qwen2-VL-style ViT (GQA) + patch merger"""
@classmethod
def filter_tensors(cls, item: tuple[str, Callable[[], Tensor]]) -> tuple[str, Callable[[], Tensor]] | None:
name, gen = item
name = name.replace("model.visual.", "visual.", 1)
return super().filter_tensors((name, gen))
def set_gguf_parameters(self):
MmprojModel.set_gguf_parameters(self)
assert self.hparams_vision is not None
hparams = self.hparams_vision
self.gguf_writer.add_clip_projector_type(gguf.VisionProjectorType.EXAONE4_5)
self.gguf_writer.add_vision_use_silu(True)
self.gguf_writer.add_vision_min_pixels(self.preprocessor_config["min_pixels"])
self.gguf_writer.add_vision_max_pixels(self.preprocessor_config["max_pixels"])
num_kv_head = self.find_vparam(["num_key_value_heads"], optional=True)
if num_kv_head is not None:
self.gguf_writer.add_vision_head_count_kv(num_kv_head)
eps = hparams.get("rms_norm_eps", self.global_config.get("rms_norm_eps", 1e-6))
self.gguf_writer.add_vision_attention_layernorm_eps(eps)
if (window_size := hparams.get("window_size")) is not None:
self.gguf_writer.add_vision_window_size(window_size)
fullatt_block_indexes = hparams.get("fullatt_block_indexes")
if fullatt_block_indexes:
n_wa_pattern = fullatt_block_indexes[0] + 1
for i in range(1, len(fullatt_block_indexes)):
if fullatt_block_indexes[i] - fullatt_block_indexes[i - 1] != n_wa_pattern:
raise ValueError(f"Invalid EXAONE4.5 fullatt_block_indexes: {fullatt_block_indexes}")
self.gguf_writer.add_vision_n_wa_pattern(n_wa_pattern)
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
if ".qkv." in name:
yield from ModelBase.modify_tensors(self, data_torch, name, bid)
return
yield from Qwen2VLVisionModel.modify_tensors(self, data_torch, name, bid)
+6 -5
View File
@@ -786,14 +786,15 @@ class Gemma4VisionAudioModel(MmprojModel):
super().set_gguf_parameters()
# vision params
assert self.hparams_vision is not None
self.gguf_writer.add_clip_vision_projector_type(gguf.VisionProjectorType.GEMMA4V)
self.gguf_writer.add_vision_attention_layernorm_eps(self.hparams.get("layer_norm_eps", 1e-6))
self.gguf_writer.add_vision_attention_layernorm_eps(self.hparams_vision.get("layer_norm_eps", 1e-6))
# audio params
if self.hparams_audio:
self.gguf_writer.add_clip_audio_projector_type(gguf.VisionProjectorType.GEMMA4A)
self.gguf_writer.add_audio_num_mel_bins(self.hparams_audio["feat_in"])
self.gguf_writer.add_audio_attention_layernorm_eps(1e-5)
assert self.hparams_audio is not None
self.gguf_writer.add_clip_audio_projector_type(gguf.VisionProjectorType.GEMMA4A)
self.gguf_writer.add_audio_num_mel_bins(self.hparams_audio["feat_in"])
self.gguf_writer.add_audio_attention_layernorm_eps(self.hparams_audio.get("layer_norm_eps", 1e-6))
def is_audio_tensor(self, name: str) -> bool:
return "audio_tower" in name or "embed_audio" in name
+3 -1
View File
@@ -139,7 +139,7 @@ models = [
{"name": "seed-coder", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Base", },
{"name": "a.x-4.0", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/skt/A.X-4.0", },
{"name": "midm-2.0", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/K-intelligence/Midm-2.0-Base-Instruct", },
{"name": "lfm2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/LiquidAI/LFM2-Tokenizer"},
{"name": "lfm2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/LiquidAI/LFM2.5-350M", },
{"name": "exaone4", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B", },
{"name": "mellum", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/JetBrains/Mellum-4b-base", },
{"name": "modern-bert", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/answerdotai/ModernBERT-base", },
@@ -183,6 +183,8 @@ pre_computed_hashes = [
# jina-v2-de variants
{"name": "jina-v2-de", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/aari1995/German_Semantic_V3", "chkhsh": "b3d1dd861f1d4c5c0d2569ce36baf3f90fe8a102db3de50dd71ff860d91be3df"},
{"name": "gpt-2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/evilfreelancer/ruGPT3XL", "chkhsh": "0fe1cf6eda062318a1af7270f3331a85c539a01778ff948e24388e949c5282f4"},
# lfm2 variants
{"name": "lfm2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/LiquidAI/LFM2.5-8B-A1B", "chkhsh": "9e454714343b69b99b71795c1d27a68c2a1d15dab111f4d353109f966af29da7"},
]
+1 -1
View File
@@ -8,7 +8,7 @@
- [Performance Reference](#performance-reference)
- [Docker](#docker)
- [Linux](#linux)
- [Windows](#windows)
- [Windows](#windows-1)
- [Environment Variable](#environment-variable)
- [Design Rule](#design-rule)
- [Known Issue](#known-issues)
+18 -1
View File
@@ -72,10 +72,13 @@ The ZenDNN backend accelerates **matrix multiplication (MUL_MAT)** and **expert-
|:----------------------:|:-------:|:---------------------------------------------:|
| FP32 | Support | Full precision floating point |
| BF16 | Support | BFloat16 (best performance on Zen 4/Zen 5) |
| Q8_0 | Support | 8-bit quantized weights via [dynamic quantization](https://github.com/amd/ZenDNN/blob/main/docs/operator/lowoha_matmul_operator.md) |
*Notes:*
- **BF16** provides best performance on Zen 4 and Zen 5 EPYC™ processors (Genoa, Turin).
- **Q8_0** is available for quantized model weights since ZenDNN supports dynamic quantization [LowOHA MatMul operator](https://github.com/amd/ZenDNN/blob/main/docs/operator/lowoha_matmul_operator.md).
- Other quantization formats fall back to the standard CPU backend unless explicitly supported by the ZenDNN backend.
## Linux
@@ -140,6 +143,15 @@ Download LLaMA 3.1 8B Instruct BF16 model:
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct-GGUF --local-dir models/
```
You can also use a Q8_0 GGUF model:
```sh
# Download a Q8_0 GGUF model from Hugging Face
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct-GGUF \
Llama-3.1-8B-Instruct-Q8_0.gguf \
--local-dir models/
```
#### 2. Start Server
Run llama.cpp server with ZenDNN acceleration:
@@ -176,6 +188,10 @@ export ZENDNNL_MATMUL_ALGO=1 # Blocked AOCL DLP algo (recommended)
For more details on available algorithms, see the [ZenDNN MatMul Algorithm Documentation](https://github.com/amd/ZenDNN/blob/a18adf8c605fb5f5e52cefd7eda08a7b18febbaf/docs/runtime_env.md#algorithm-details).
### Q8_0 Performance Notes
Q8_0 support is mainly beneficial for prompt processing / prefill workloads where large matrix multiplications dominate execution. Token generation performance may remain close to the standard CPU backend depending on the model, batch size, number of threads, and CPU topology.
### Profiling and Debugging
For detailed profiling and logging options, refer to the [ZenDNN Logging Documentation](https://github.com/amd/ZenDNN/blob/a18adf8c605fb5f5e52cefd7eda08a7b18febbaf/docs/logging.md).
@@ -184,6 +200,7 @@ For detailed profiling and logging options, refer to the [ZenDNN Logging Documen
- **Limited operation support**: Currently matrix multiplication (MUL_MAT) and expert-based matrix multiplication (MUL_MAT_ID) are accelerated via ZenDNN. Other operations fall back to the standard CPU backend. Future updates may expand supported operations.
- **BF16 support**: BF16 operations require AMD Zen 4 or Zen 5 architecture (EPYC 9004/9005 series). On older CPUs, operations will use FP32.
- **Q8_0 support scope**: Q8_0 acceleration is available for supported matrix multiplication paths. Other quantization formats still fall back to the standard CPU backend.
- **NUMA awareness**: For multi-socket systems, manual NUMA binding may be required for optimal performance.
## Q&A
@@ -202,7 +219,7 @@ A: ZenDNN is optimized specifically for AMD processors. While it may work on oth
**Q: Does ZenDNN support quantized models?**
A: Currently, ZenDNN primarily supports FP32 and BF16 data types. Quantized model support is not available at this time.
A: Yes. The ZenDNN backend supports Q8_0 quantized models for supported matrix multiplication operations. FP32 and BF16 are also supported. Other quantization formats may fall back to the standard CPU backend unless explicitly supported by the ZenDNN backend.
**Q: Why is my inference not faster with ZenDNN?**
+1
View File
@@ -22,6 +22,7 @@ The following sections describe how to build with different backends and options
* [HIP](#hip)
* [Vulkan](#vulkan)
* [CANN](#cann)
* [ZenDNN](#zendnn)
* [Arm® KleidiAI™](#arm-kleidiai)
* [OpenCL](#opencl)
* [Android](#android-1)
+1 -1
View File
@@ -55,7 +55,7 @@ Legend:
| GELU | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
| GELU_ERF | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
| GELU_QUICK | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
| GET_ROWS | ❌ | 🟡 | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ❌ | ❌ |
| GET_ROWS | ❌ | 🟡 | ✅ | 🟡 | 🟡 | 🟡 | | ✅ | 🟡 | ❌ | ❌ |
| GET_ROWS_BACK | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| GROUP_NORM | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| HARDSIGMOID | ❌ | ✅ | ✅ | 🟡 | ✅ | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
+2004 -1555
View File
File diff suppressed because it is too large Load Diff
+5
View File
@@ -323,3 +323,8 @@ statistics ngram_map_k: #calls(b,g,a) = 6 1690 26, #gen drafts = 26, #acc drafts
- `#gen tokens`: number of tokens generated by this implementation (including rejected tokens)
- `#acc tokens`: number of tokens accepted by the main model
- `dur(b,g,a): durations of begin (new prompt), generation and accumulation (process acceptance).
## Benchmarking
To measure the end-to-end effect of speculative decoding (throughput, latency, and draft acceptance) across diverse prompts, see the SPEED-Bench client in [tools/server/bench/speed-bench](../tools/server/bench/speed-bench/README.md).
It runs against a running `llama-server` and can compare a baseline run against a speculative-decoding run.
+1 -1
View File
@@ -5,7 +5,7 @@ project("ggml" C CXX ASM)
### GGML Version
set(GGML_VERSION_MAJOR 0)
set(GGML_VERSION_MINOR 13)
set(GGML_VERSION_PATCH 0)
set(GGML_VERSION_PATCH 1)
set(GGML_VERSION_BASE "${GGML_VERSION_MAJOR}.${GGML_VERSION_MINOR}.${GGML_VERSION_PATCH}")
list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/cmake/")
+7 -3
View File
@@ -381,11 +381,15 @@ extern "C" {
// - most tensors have n_segments == 1 and a contiguous slice of the tensor data
// - some tensors have an inhomogenenous data layout along the split axis,
// those tensors are divided into segments which are each individually split across devices
// - ne has one entry per segment and device that add up to ggml_tensor::ne for that axis,
// the outer/inner loops are over segments/devices like [seg0_dev0, seg0_dev1, seg1_dev0, seg1_dev1],
// - ne has one entry per segment and device and that segment repeats nr times,
// in total when accounting for repetitions the segments add up to ggml_tensor::ne for that axis,
// the outer/inner loops are over segments/devices like [seg0_dev0_r0, seg0_dev1_r0, seg0_dev0_r1, seg0_dev1_r1, seg1_dev0_r0, seg1_dev1_r0],
// - for example, a transformer may have a fused QKV matrix rather than 3 matrices, those would be 3 separate segments
// that each need to be split individually across devices so that each device gets a slice of Q, K, and V
// that each need to be split individually across devices so that each device gets a slice of Q, K, and V,
// the Q matrix can be larger than the K and V matrices so this can either be expressed as 3 segments or as 2 segments
// where the segment for K/V repeats twice
int64_t ne[16*GGML_BACKEND_META_MAX_DEVICES];
uint32_t nr[16];
uint32_t n_segments;
};
+143 -136
View File
@@ -487,6 +487,9 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(co
static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
ggml_backend_meta_simple_tensor_container & stc, const struct ggml_tensor * tensor, bool assume_sync) {
// FIXME Currently this function preserves/erases the information in n_segments and nr in an inconsistent way.
// Since the operations in question are developed specifically for llama.cpp this currently does not manifest as a bug there.
// However, in a broader ggml context with arbitrary ggml graphs this can lead to unexpected results.
const size_t n_bufs = ggml_backend_meta_buffer_n_bufs(tensor->buffer);
ggml_backend_meta_buffer_context * buf_ctx = (ggml_backend_meta_buffer_context *) tensor->buffer->context;
@@ -497,11 +500,11 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
for (size_t j = 0; j < n_bufs; j++) {
int64_t sum_a = 0;
for (size_t s = 0; s < a.n_segments; s++) {
sum_a += a.ne[s*n_bufs + j];
sum_a += a.ne[s*n_bufs + j] * a.nr[s];
}
int64_t sum_b = 0;
for (size_t s = 0; s < b.n_segments; s++) {
sum_b += b.ne[s*n_bufs + j];
sum_b += b.ne[s*n_bufs + j] * b.nr[s];
}
if (sum_a != sum_b) {
return false;
@@ -511,7 +514,7 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
};
auto handle_generic = [&](const std::vector<ggml_backend_meta_split_state> & src_ss, bool scalar_only) -> ggml_backend_meta_split_state {
ggml_backend_meta_split_state ret = {GGML_BACKEND_SPLIT_AXIS_NONE, {0}, 1};
ggml_backend_meta_split_state ret = {GGML_BACKEND_SPLIT_AXIS_NONE, {0}, {1}, 1};
for (size_t i = 0; i < GGML_MAX_SRC; i++) {
if (tensor->src[i] == nullptr || tensor->src[i] == tensor) {
continue;
@@ -519,15 +522,15 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
if (ret.axis == GGML_BACKEND_SPLIT_AXIS_NONE) {
ret = src_ss[i];
} else if (!split_states_equal(src_ss[i], ret)) {
ret = {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, 1};
ret = {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, {1}, 1};
break;
}
}
if (ret.axis == GGML_BACKEND_SPLIT_AXIS_NONE) {
ret = {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, 1};
ret = {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, {1}, 1};
}
if (scalar_only && ret.axis >= 0 && ret.axis < GGML_MAX_DIMS) {
ret = {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, 1};
ret = {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, {1}, 1};
}
GGML_ASSERT(ret.axis != GGML_BACKEND_SPLIT_AXIS_UNKNOWN);
return ret;
@@ -571,42 +574,24 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
auto handle_mul_mat = [&](const std::vector<ggml_backend_meta_split_state> & src_ss) -> ggml_backend_meta_split_state {
if (src_ss[0].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED && src_ss[1].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED) {
return {GGML_BACKEND_SPLIT_AXIS_MIRRORED, {0}, 1};
return {GGML_BACKEND_SPLIT_AXIS_MIRRORED, {0}, {1}, 1};
}
if (src_ss[0].axis == GGML_BACKEND_SPLIT_AXIS_1 && src_ss[1].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED) {
ggml_backend_meta_split_state ret = src_ss[0];
ret.axis = GGML_BACKEND_SPLIT_AXIS_0;
ret.nr[0] = 1;
ret.n_segments = 1;
return ret;
}
if (src_ss[1].axis == GGML_BACKEND_SPLIT_AXIS_1 && src_ss[0].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED) {
ggml_backend_meta_split_state ret = src_ss[1];
ret.n_segments = 1;
return ret;
return src_ss[1];
}
if (src_ss[0].axis == GGML_BACKEND_SPLIT_AXIS_0 && src_ss[1].axis == GGML_BACKEND_SPLIT_AXIS_0) {
GGML_ASSERT(split_states_equal(src_ss[0], src_ss[1]));
return {assume_sync ? GGML_BACKEND_SPLIT_AXIS_MIRRORED : GGML_BACKEND_SPLIT_AXIS_PARTIAL, {0}, 1};
return {assume_sync ? GGML_BACKEND_SPLIT_AXIS_MIRRORED : GGML_BACKEND_SPLIT_AXIS_PARTIAL, {0}, {1}, 1};
}
GGML_ABORT("fatal error");
//return {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, 1};
};
auto handle_cpy = [&](const std::vector<ggml_backend_meta_split_state> & src_ss) -> ggml_backend_meta_split_state {
if (src_ss[0].axis >= 0 && src_ss[0].axis < GGML_MAX_DIMS) {
int64_t ne_split_src = tensor->src[0]->ne[0];
for (int dim = 1; dim <= src_ss[0].axis; dim++) {
ne_split_src *= tensor->src[0]->ne[dim];
}
int64_t ne_split_dst = 1;
for (int dim = 0; dim < GGML_MAX_DIMS; dim++) {
ne_split_dst *= tensor->ne[dim];
if (ne_split_dst == ne_split_src) {
return {ggml_backend_meta_split_axis(dim), {0}, 1};
}
}
}
return handle_generic(src_ss, /*scalar_only =*/ false);
//return {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, {1}, 1};
};
auto handle_reshape = [&](const std::vector<ggml_backend_meta_split_state> & src_ss) -> ggml_backend_meta_split_state {
@@ -615,33 +600,25 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
case GGML_BACKEND_SPLIT_AXIS_1:
case GGML_BACKEND_SPLIT_AXIS_2:
case GGML_BACKEND_SPLIT_AXIS_3: {
GGML_ASSERT(!ggml_is_permuted(tensor) && !ggml_is_permuted(tensor->src[0]));
if (src_ss[0].axis == ggml_n_dims(tensor->src[0]) - 1) {
return {ggml_backend_meta_split_axis(ggml_n_dims(tensor) - 1), {0}, 1};
GGML_ASSERT(src_ss[0].n_segments == 1);
if (src_ss[0].axis == ggml_n_dims(tensor->src[0]) - 1 && src_ss[0].nr[0] == 1) {
return {ggml_backend_meta_split_axis(ggml_n_dims(tensor) - 1), {0}, {1}, 1};
}
std::vector<int64_t> base_ne_in;
base_ne_in.reserve(GGML_MAX_DIMS - src_ss[0].axis);
{
base_ne_in.push_back(1);
int dim = 0;
for (; dim <= src_ss[0].axis; dim++) {
base_ne_in[0] *= tensor->src[0]->ne[dim];
}
for (; dim <= GGML_MAX_DIMS; dim++) {
base_ne_in.push_back(base_ne_in.back() * tensor->src[0]->ne[dim]);
}
int64_t base_ne_in = tensor->src[0]->ne[0];
for (int dim = 1; dim <= src_ss[0].axis; dim++) {
base_ne_in *= tensor->src[0]->ne[dim];
}
base_ne_in /= src_ss[0].nr[0];
int64_t base_ne_out = 1;
for (int dim = 0; dim < GGML_MAX_DIMS; dim++) {
const int64_t base_ne_out_next = base_ne_out *= tensor->ne[dim];
for (const int64_t & bni : base_ne_in) {
if (bni == base_ne_out_next) {
return {ggml_backend_meta_split_axis(dim), {0}, 1};
}
if (base_ne_out_next % base_ne_in == 0) {
return {ggml_backend_meta_split_axis(dim), {0}, {uint32_t(base_ne_out_next/base_ne_in)}, 1};
}
if (base_ne_out_next > base_ne_in[0]) {
GGML_ASSERT(dim + 1 < GGML_MAX_DIMS);
return {ggml_backend_meta_split_axis(dim + 1), {0}, 1};
if (base_ne_out_next > base_ne_in) {
GGML_ASSERT(src_ss[0].n_segments == 1);
GGML_ASSERT(src_ss[0].nr[0] == 1);
return {ggml_backend_meta_split_axis(dim), {0}, {1}, 1};
}
base_ne_out = base_ne_out_next;
}
@@ -653,11 +630,18 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
}
default: {
GGML_ABORT("fatal error");
//return {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, 1};
//return {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, {1}, 1};
}
}
};
auto handle_cpy = [&](const std::vector<ggml_backend_meta_split_state> & src_ss) -> ggml_backend_meta_split_state {
if (src_ss[0].axis >= 0 && src_ss[0].axis < GGML_MAX_DIMS) {
return handle_reshape(src_ss);
}
return handle_generic(src_ss, /*scalar_only =*/ false);
};
auto handle_view = [&](const std::vector<ggml_backend_meta_split_state> & src_ss) -> ggml_backend_meta_split_state {
if (ggml_is_contiguous(tensor) && ggml_is_contiguous(tensor->src[0])) {
return handle_reshape(src_ss);
@@ -681,7 +665,7 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
if (!ggml_is_permuted(tensor) && !ggml_is_permuted(tensor->src[0]) && axis >= 0 && axis < GGML_MAX_DIMS-1) {
for (int dim = 0; dim < GGML_MAX_DIMS-1; dim++) {
if (tensor->nb[dim+1] == tensor->src[0]->nb[axis+1]) {
return {ggml_backend_meta_split_axis(dim), {0}, 1};
return {ggml_backend_meta_split_axis(dim), {0}, {1}, 1};
}
}
GGML_ABORT("fatal error");
@@ -690,7 +674,7 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
return src_ss[0];
}
GGML_ABORT("view of permuted tensor not implemented");
//return {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, 1};
//return {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, {1}, 1};
};
auto handle_permute = [&](const std::vector<ggml_backend_meta_split_state> & src_ss) -> ggml_backend_meta_split_state {
@@ -699,7 +683,8 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
case GGML_BACKEND_SPLIT_AXIS_1:
case GGML_BACKEND_SPLIT_AXIS_2:
case GGML_BACKEND_SPLIT_AXIS_3: {
return {ggml_backend_meta_split_axis(tensor->op_params[src_ss[0].axis]), {0}, 1};
GGML_ASSERT(src_ss[0].n_segments == 1 || src_ss[0].nr[0] == 1);
return {ggml_backend_meta_split_axis(tensor->op_params[src_ss[0].axis]), {0}, {src_ss[0].nr[0]}, 1};
}
case GGML_BACKEND_SPLIT_AXIS_MIRRORED:
case GGML_BACKEND_SPLIT_AXIS_PARTIAL: {
@@ -707,7 +692,7 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
}
default: {
GGML_ABORT("fatal error");
//return {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, 1};
//return {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, {1}, 1};
}
}
};
@@ -716,7 +701,8 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
switch (src_ss[0].axis) {
case GGML_BACKEND_SPLIT_AXIS_0:
case GGML_BACKEND_SPLIT_AXIS_1: {
return {ggml_backend_meta_split_axis(int(src_ss[0].axis) ^ 1), {0}, 1};
GGML_ASSERT(src_ss[0].n_segments == 1 || src_ss[0].nr[0] == 1);
return {ggml_backend_meta_split_axis(int(src_ss[0].axis) ^ 1), {0}, {src_ss[0].nr[0]}, 1};
}
case GGML_BACKEND_SPLIT_AXIS_2:
case GGML_BACKEND_SPLIT_AXIS_3:
@@ -726,7 +712,7 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
}
default: {
GGML_ABORT("fatal error");
//return {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, 1};
//return {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, {1}, 1};
}
}
};
@@ -764,16 +750,16 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
GGML_ASSERT( src_ss[2].axis == GGML_BACKEND_SPLIT_AXIS_2);
GGML_ASSERT(tensor->src[4] == nullptr || src_ss[3].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED);
GGML_ASSERT(tensor->src[4] == nullptr || src_ss[4].axis == GGML_BACKEND_SPLIT_AXIS_0);
return {GGML_BACKEND_SPLIT_AXIS_1, {0}, 1};
return {GGML_BACKEND_SPLIT_AXIS_1, {0}, {1}, 1};
};
auto handle_ssm_conv = [&](const std::vector<ggml_backend_meta_split_state> & src_ss) -> ggml_backend_meta_split_state {
if (src_ss[0].axis == src_ss[1].axis) {
if (src_ss[0].axis == GGML_BACKEND_SPLIT_AXIS_0) {
return {GGML_BACKEND_SPLIT_AXIS_1, {0}, 1};
return {GGML_BACKEND_SPLIT_AXIS_1, {0}, {1}, 1};
}
if (src_ss[0].axis == GGML_BACKEND_SPLIT_AXIS_1) {
return {GGML_BACKEND_SPLIT_AXIS_0, {0}, 1};
return {GGML_BACKEND_SPLIT_AXIS_0, {0}, {1}, 1};
}
}
return handle_generic(src_ss, /*scalar_only =*/ false);
@@ -781,8 +767,8 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
auto handle_gated_delta_net = [&](const std::vector<ggml_backend_meta_split_state> & src_ss) -> ggml_backend_meta_split_state {
if (src_ss[0].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED && src_ss[1].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED &&
src_ss[2].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED && src_ss[3].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED &&
src_ss[4].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED && src_ss[5].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED) {
src_ss[2].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED && src_ss[3].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED &&
src_ss[4].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED && src_ss[5].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED) {
return src_ss[0];
}
GGML_ASSERT(src_ss[0].axis == GGML_BACKEND_SPLIT_AXIS_1);
@@ -793,12 +779,12 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
// state shape is (S_v*S_v*H, K, n_seqs); the heads dim is nested inside axis 0,
// so a head-aligned split on the input cache reshapes to axis 0 here (not axis 2).
GGML_ASSERT(src_ss[5].axis == GGML_BACKEND_SPLIT_AXIS_2 || src_ss[5].axis == GGML_BACKEND_SPLIT_AXIS_1 || src_ss[5].axis == GGML_BACKEND_SPLIT_AXIS_0);
return {GGML_BACKEND_SPLIT_AXIS_0, {0}, 1};
return {GGML_BACKEND_SPLIT_AXIS_0, {0}, {1}, 1};
};
auto calculate_split_state = [&]() -> ggml_backend_meta_split_state {
if (ggml_nelements(tensor) == 0) {
return {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, 1};
return {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, {1}, 1};
}
if (ggml_backend_buffer_get_usage(tensor->buffer) != GGML_BACKEND_BUFFER_USAGE_COMPUTE && tensor->view_src == nullptr) {
ggml_backend_dev_t dev = ggml_backend_buft_get_device(ggml_backend_buffer_get_type(tensor->buffer));
@@ -807,19 +793,21 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
if (ret.axis >= 0 && ret.axis <= GGML_MAX_DIMS) {
const int64_t granularity = ret.axis == GGML_BACKEND_SPLIT_AXIS_0 ? ggml_blck_size(tensor->type) : 1;
int64_t ne_sum = 0;
for (size_t sj = 0; sj < ret.n_segments*n_bufs; sj++) {
GGML_ASSERT(ret.ne[sj] % granularity == 0);
ne_sum += ret.ne[sj];
for (size_t s = 0; s < ret.n_segments; s++) {
for (size_t j = 0; j < n_bufs; j++) {
GGML_ASSERT(ret.ne[s*n_bufs + j] % granularity == 0);
ne_sum += ret.ne[s*n_bufs + j] * ret.nr[s];
}
}
GGML_ASSERT(ne_sum == tensor->ne[ret.axis]);
}
return ret;
}
std::vector<ggml_backend_meta_split_state> src_ss(GGML_MAX_SRC, {GGML_BACKEND_SPLIT_AXIS_NONE, {0}, 1});
std::vector<ggml_backend_meta_split_state> src_ss(GGML_MAX_SRC, {GGML_BACKEND_SPLIT_AXIS_NONE, {0}, {1}, 1});
for (size_t i = 0; i < GGML_MAX_SRC; i++) {
if (tensor->src[i] == nullptr || tensor->src[i] == tensor) {
src_ss[i] = {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, 1};
src_ss[i] = {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, {1}, 1};
continue;
}
src_ss[i] = ggml_backend_meta_get_split_state(stc, tensor->src[i], /*assume_sync =*/ true);
@@ -829,7 +817,7 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
ggml_backend_meta_split_state split_state;
switch (tensor->op) {
case GGML_OP_NONE: {
split_state = {GGML_BACKEND_SPLIT_AXIS_MIRRORED, {0}, 1};
split_state = {GGML_BACKEND_SPLIT_AXIS_MIRRORED, {0}, {1}, 1};
} break;
case GGML_OP_DUP: {
split_state = handle_generic(src_ss, /*scalar_only =*/ true);
@@ -1016,7 +1004,7 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
} break;
default: {
GGML_ABORT("ggml op not implemented: %s", ggml_op_name(tensor->op));
split_state = {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, 1};
split_state = {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, {1}, 1};
} break;
}
if (split_state.axis >= 0 && split_state.axis < GGML_MAX_DIMS) {
@@ -1034,23 +1022,25 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
split_state.ne[s*n_bufs + j] = 0;
}
for (size_t s = 0; s < src_ss[i].n_segments; s++) {
split_state.ne[j] += src_ss[i].ne[s*n_bufs + j];
split_state.ne[j] += src_ss[i].ne[s*n_bufs + j] * src_ss[i].nr[s];
}
split_state.ne[j] *= tensor->ne[split_state.axis];
if (split_state.ne[j] != 0 || tensor->src[i]->ne[src_ss[i].axis] != 0) {
GGML_ASSERT(split_state.ne[j] % tensor->src[i]->ne[src_ss[i].axis] == 0);
split_state.ne[j] /= tensor->src[i]->ne[src_ss[i].axis];
const int64_t div = tensor->src[i]->ne[src_ss[i].axis] * split_state.nr[0];
GGML_ASSERT(split_state.ne[j] % div == 0);
split_state.ne[j] /= div;
}
}
} else {
GGML_ASSERT(split_state.n_segments == 1);
for (size_t j = 0; j < n_bufs; j++) {
// Assert that ratio is consistent:
int64_t sum = 0;
for (size_t s = 0; s < src_ss[i].n_segments; s++) {
sum += src_ss[i].ne[s*n_bufs + j];
sum += src_ss[i].ne[s*n_bufs + j] * src_ss[i].nr[s];
}
// Assert that ratio is consistent:
GGML_ASSERT(split_state.ne[j] * tensor->src[i]->ne[src_ss[i].axis]
== sum * tensor->ne[split_state.axis]);
GGML_ASSERT(split_state.ne[j]*split_state.nr[0] * tensor->src[i]->ne[src_ss[i].axis]
== sum * tensor->ne[split_state.axis]);
}
}
first_src_split_by_axis = false;
@@ -1080,13 +1070,14 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
srcs_info += ", ";
}
const ggml_backend_meta_split_state split_state = ggml_backend_meta_get_split_state(tensor->src[0], true);
GGML_ASSERT(split_state.n_segments == 1);
const char * axis_name = ggml_backend_meta_split_axis_name(split_state.axis);
std::string ne_info;
for (size_t j = 0; j < n_bufs; j++) {
if (!ne_info.empty()) {
ne_info += ", ";
}
ne_info += std::to_string(split_state.ne[j]);
ne_info += std::to_string(split_state.ne[j]) + "x" + std::to_string(split_state.nr[0]);
}
srcs_info += std::string(tensor->src[i]->name) + "[" + ggml_op_name(tensor->src[i]->op) + ", " + axis_name + ", {" + ne_info + "}]";
}
@@ -1095,7 +1086,8 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
if (!ne_info.empty()) {
ne_info += ", ";
}
ne_info += std::to_string(buf_ctx->split_state_cache[key].first.ne[j]);
const ggml_backend_meta_split_state & ss = buf_ctx->split_state_cache[key].first;
ne_info += std::to_string(ss.ne[j]) + "x" + std::to_string(ss.nr[0]);
}
GGML_LOG_DEBUG("SPLIT_STATE: {%s} -> %s[%s, %s, {%s}]\n", srcs_info.c_str(), tensor->name, ggml_op_name(tensor->op),
ggml_backend_meta_split_axis_name(buf_ctx->split_state_cache[key].first.axis), ne_info.c_str());
@@ -1107,8 +1099,10 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
#ifndef NDEBUG
if (ret.axis >= 0 && ret.axis < GGML_MAX_DIMS) {
int64_t ne_ret = 0;
for (size_t sj = 0; sj < ret.n_segments*n_bufs; sj++) {
ne_ret += ret.ne[sj];
for (size_t s = 0; s < ret.n_segments; s++) {
for (size_t j = 0; j < n_bufs; j++) {
ne_ret += ret.ne[s*n_bufs + j] * ret.nr[s];
}
}
assert(ne_ret == tensor->ne[int(ret.axis)]);
}
@@ -1155,7 +1149,7 @@ static enum ggml_status ggml_backend_meta_buffer_init_tensor_impl(ggml_backend_m
// GGML_ASSERT(ggml_is_contiguously_allocated(tensor));
ne[split_dim] = 0;
for (size_t s = 0; s < split_state.n_segments; s++) {
ne[split_dim] += split_state.ne[s*n_simple_bufs + j];
ne[split_dim] += split_state.ne[s*n_simple_bufs + j] * split_state.nr[s];
}
for (int i = 0; i < GGML_MAX_DIMS; i++) {
if (tensor->nb[i] > tensor->nb[split_dim]) {
@@ -1229,7 +1223,7 @@ static enum ggml_status ggml_backend_meta_buffer_init_tensor_impl(ggml_backend_m
for (size_t j = 0; j < n_simple_bufs; j++) {
int64_t ne_sum = 0;
for (size_t s = 0; s < split_state_src.n_segments; s++) {
ne_sum += split_state_src.ne[s*n_simple_bufs + j];
ne_sum += split_state_src.ne[s*n_simple_bufs + j] * split_state_src.nr[s];
}
if (ne_sum == 0) {
simple_tensors[j]->flags &= ~GGML_TENSOR_FLAG_COMPUTE;
@@ -1255,8 +1249,9 @@ static void ggml_backend_meta_buffer_set_tensor(ggml_backend_buffer_t buffer, gg
const ggml_backend_meta_split_state split_state = ggml_backend_meta_get_split_state(tensor, /*assume_sync =*/ false);
if (split_state.n_segments != 1) {
if (split_state.n_segments != 1 || split_state.nr[0] != 1) {
GGML_ASSERT(split_state.axis >= 0 && split_state.axis < GGML_MAX_DIMS);
GGML_ASSERT(split_state.nr[0] != 0);
GGML_ASSERT(tensor->ne[3] == 1);
size_t offset_data = 0;
@@ -1267,24 +1262,26 @@ static void ggml_backend_meta_buffer_set_tensor(ggml_backend_buffer_t buffer, gg
const size_t row_stride = tensor->nb[1];
GGML_ASSERT(offset % row_stride == 0);
GGML_ASSERT(size % row_stride == 0);
const int64_t r_start = offset / row_stride;
const int64_t r_count = size / row_stride;
GGML_ASSERT(r_start + r_count <= tensor->ne[1]);
const int64_t row_start = offset / row_stride;
const int64_t row_count = size / row_stride;
GGML_ASSERT(row_start + row_count <= tensor->ne[1]);
const int64_t blck_size = ggml_blck_size(tensor->type);
for (size_t s = 0; s < split_state.n_segments; s++) {
for (size_t j = 0; j < n_bufs; j++) {
ggml_tensor * simple_tensor = ggml_backend_meta_buffer_simple_tensor(tensor, j);
GGML_ASSERT(split_state.ne[s*n_bufs + j] % blck_size == 0);
const size_t nbytes = split_state.ne[s*n_bufs + j]/blck_size * tensor->nb[0];
ggml_backend_tensor_set_2d(simple_tensor, (const char *) data + offset_data,
simple_offsets[j] + r_start * simple_tensor->nb[1], nbytes,
r_count, simple_tensor->nb[1], tensor->nb[1]);
offset_data += nbytes;
simple_offsets[j] += nbytes;
for (size_t r = 0; r < split_state.nr[s]; r++) {
for (size_t j = 0; j < n_bufs; j++) {
ggml_tensor * simple_tensor = ggml_backend_meta_buffer_simple_tensor(tensor, j);
GGML_ASSERT(split_state.ne[s*n_bufs + j] % blck_size == 0);
const size_t nbytes = split_state.ne[s*n_bufs + j]/blck_size * tensor->nb[0];
ggml_backend_tensor_set_2d(simple_tensor, (const char *) data + offset_data,
simple_offsets[j] + row_start * simple_tensor->nb[1], nbytes,
row_count, simple_tensor->nb[1], tensor->nb[1]);
offset_data += nbytes;
simple_offsets[j] += nbytes;
}
}
}
GGML_ASSERT(offset_data*r_count == size);
GGML_ASSERT(offset_data*row_count == size);
return;
}
GGML_ASSERT(split_state.axis == GGML_BACKEND_SPLIT_AXIS_1);
@@ -1292,22 +1289,24 @@ static void ggml_backend_meta_buffer_set_tensor(ggml_backend_buffer_t buffer, gg
const size_t row_stride = tensor->nb[2];
GGML_ASSERT(offset % row_stride == 0);
GGML_ASSERT(size % row_stride == 0);
const int64_t r_start = offset / row_stride;
const int64_t r_count = size / row_stride;
GGML_ASSERT(r_start + r_count <= tensor->ne[2]);
const int64_t row_start = offset / row_stride;
const int64_t row_count = size / row_stride;
GGML_ASSERT(row_start + row_count <= tensor->ne[2]);
for (size_t s = 0; s < split_state.n_segments; s++) {
for (size_t j = 0; j < n_bufs; j++) {
ggml_tensor * simple_tensor = ggml_backend_meta_buffer_simple_tensor(tensor, j);
const size_t nbytes = split_state.ne[s*n_bufs + j] * tensor->nb[1];
ggml_backend_tensor_set_2d(simple_tensor, (const char *) data + offset_data,
simple_offsets[j] + r_start * simple_tensor->nb[2], nbytes,
r_count, simple_tensor->nb[2], tensor->nb[2]);
offset_data += nbytes;
simple_offsets[j] += nbytes;
for (size_t r = 0; r < split_state.nr[s]; r++) {
for (size_t j = 0; j < n_bufs; j++) {
ggml_tensor * simple_tensor = ggml_backend_meta_buffer_simple_tensor(tensor, j);
const size_t nbytes = split_state.ne[s*n_bufs + j] * tensor->nb[1];
ggml_backend_tensor_set_2d(simple_tensor, (const char *) data + offset_data,
simple_offsets[j] + row_start * simple_tensor->nb[2], nbytes,
row_count, simple_tensor->nb[2], tensor->nb[2]);
offset_data += nbytes;
simple_offsets[j] += nbytes;
}
}
}
GGML_ASSERT(offset_data*r_count == size);
GGML_ASSERT(offset_data*row_count == size);
return;
}
@@ -1365,8 +1364,9 @@ static void ggml_backend_meta_buffer_get_tensor(ggml_backend_buffer_t buffer, co
const ggml_backend_meta_split_state split_state = ggml_backend_meta_get_split_state(tensor, /*assume_sync =*/ false);
if (split_state.n_segments != 1) {
if (split_state.n_segments != 1 || split_state.nr[0] != 1) {
GGML_ASSERT(split_state.axis >= 0 && split_state.axis < GGML_MAX_DIMS);
GGML_ASSERT(split_state.nr[0] != 0);
GGML_ASSERT(tensor->ne[3] == 1);
size_t offset_data = 0;
@@ -1377,24 +1377,26 @@ static void ggml_backend_meta_buffer_get_tensor(ggml_backend_buffer_t buffer, co
const size_t row_stride = tensor->nb[1];
GGML_ASSERT(offset % row_stride == 0);
GGML_ASSERT(size % row_stride == 0);
const int64_t r_start = offset / row_stride;
const int64_t r_count = size / row_stride;
GGML_ASSERT(r_start + r_count <= tensor->ne[1]);
const int64_t row_start = offset / row_stride;
const int64_t row_count = size / row_stride;
GGML_ASSERT(row_start + row_count <= tensor->ne[1]);
const int64_t blck_size = ggml_blck_size(tensor->type);
for (size_t s = 0; s < split_state.n_segments; s++) {
for (size_t j = 0; j < n_bufs; j++) {
const ggml_tensor * simple_tensor = ggml_backend_meta_buffer_simple_tensor(tensor, j);
GGML_ASSERT(split_state.ne[s*n_bufs + j] % blck_size == 0);
const size_t nbytes = split_state.ne[s*n_bufs + j]/blck_size * tensor->nb[0];
ggml_backend_tensor_get_2d(simple_tensor, (char *) data + offset_data,
simple_offsets[j] + r_start * simple_tensor->nb[1], nbytes,
r_count, simple_tensor->nb[1], tensor->nb[1]);
offset_data += nbytes;
simple_offsets[j] += nbytes;
for (size_t r = 0; r < split_state.nr[s]; r++) {
for (size_t j = 0; j < n_bufs; j++) {
const ggml_tensor * simple_tensor = ggml_backend_meta_buffer_simple_tensor(tensor, j);
GGML_ASSERT(split_state.ne[s*n_bufs + j] % blck_size == 0);
const size_t nbytes = split_state.ne[s*n_bufs + j]/blck_size * tensor->nb[0];
ggml_backend_tensor_get_2d(simple_tensor, (char *) data + offset_data,
simple_offsets[j] + row_start * simple_tensor->nb[1], nbytes,
row_count, simple_tensor->nb[1], tensor->nb[1]);
offset_data += nbytes;
simple_offsets[j] += nbytes;
}
}
}
GGML_ASSERT(offset_data*r_count == size);
GGML_ASSERT(offset_data*row_count == size);
return;
}
GGML_ASSERT(split_state.axis == GGML_BACKEND_SPLIT_AXIS_1);
@@ -1402,22 +1404,24 @@ static void ggml_backend_meta_buffer_get_tensor(ggml_backend_buffer_t buffer, co
const size_t row_stride = tensor->nb[2];
GGML_ASSERT(offset % row_stride == 0);
GGML_ASSERT(size % row_stride == 0);
const int64_t r_start = offset / row_stride;
const int64_t r_count = size / row_stride;
GGML_ASSERT(r_start + r_count <= tensor->ne[2]);
const int64_t row_start = offset / row_stride;
const int64_t row_count = size / row_stride;
GGML_ASSERT(row_start + row_count <= tensor->ne[2]);
for (size_t s = 0; s < split_state.n_segments; s++) {
for (size_t j = 0; j < n_bufs; j++) {
const ggml_tensor * simple_tensor = ggml_backend_meta_buffer_simple_tensor(tensor, j);
const size_t nbytes = split_state.ne[s*n_bufs + j] * tensor->nb[1];
ggml_backend_tensor_get_2d(simple_tensor, (char *) data + offset_data,
simple_offsets[j] + r_start * simple_tensor->nb[2], nbytes,
r_count, simple_tensor->nb[2], tensor->nb[2]);
offset_data += nbytes;
simple_offsets[j] += nbytes;
for (size_t r = 0; r < split_state.nr[s]; r++) {
for (size_t j = 0; j < n_bufs; j++) {
const ggml_tensor * simple_tensor = ggml_backend_meta_buffer_simple_tensor(tensor, j);
const size_t nbytes = split_state.ne[s*n_bufs + j] * tensor->nb[1];
ggml_backend_tensor_get_2d(simple_tensor, (char *) data + offset_data,
simple_offsets[j] + row_start * simple_tensor->nb[2], nbytes,
row_count, simple_tensor->nb[2], tensor->nb[2]);
offset_data += nbytes;
simple_offsets[j] += nbytes;
}
}
}
GGML_ASSERT(offset_data*r_count == size);
GGML_ASSERT(offset_data*row_count == size);
return;
}
@@ -1675,6 +1679,7 @@ static void ggml_backend_meta_set_tensor_async(ggml_backend_t backend, ggml_tens
const ggml_backend_meta_split_state split_state = ggml_backend_meta_get_split_state(tensor, /*assume_sync =*/ false);
GGML_ASSERT(split_state.n_segments == 1);
GGML_ASSERT(split_state.nr[0] == 1);
switch (split_state.axis) {
case GGML_BACKEND_SPLIT_AXIS_0:
@@ -1719,6 +1724,7 @@ static void ggml_backend_meta_get_tensor_async(ggml_backend_t backend, const ggm
const ggml_backend_meta_split_state split_state = ggml_backend_meta_get_split_state(tensor, /*assume_sync =*/ false);
GGML_ASSERT(split_state.n_segments == 1);
GGML_ASSERT(split_state.nr[0] == 1);
switch (split_state.axis) {
case GGML_BACKEND_SPLIT_AXIS_0:
@@ -2076,6 +2082,7 @@ static enum ggml_status ggml_backend_meta_graph_compute(ggml_backend_t backend,
node_zero->src[0] = node;
ggml_set_op_params_f32(node_zero, 0, 0.0f);
node_zero->data = node->data;
node_zero->buffer = node->buffer;
node_zero->flags |= GGML_TENSOR_FLAG_COMPUTE;
step_cgraphs[j] = get_cgraph_aux();
+151
View File
@@ -977,6 +977,35 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
sumf = hsum_float_8(acc);
*s = sumf;
#elif defined(__loongarch_sx)
__m128 acc = (__m128)__lsx_vldi(0);
for (; ib < nb; ++ib) {
const float d = GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d);
const __m128i qx_0 = __lsx_vld((const __m128i *)x[ib].qs, 0);
const __m128i qx_1 = __lsx_vld((const __m128i *)x[ib].qs + 1, 0);
const __m128i qy_0 = __lsx_vld((const __m128i *)y[ib].qs, 0);
const __m128i qy_1 = __lsx_vld((const __m128i *)y[ib].qs + 1, 0);
const __m128i p16_0 = lsx_maddubs_h(qx_0, qy_0);
const __m128i p16_1 = lsx_maddubs_h(qx_1, qy_1);
// Sum int16 pairs → int32
const __m128i s_0 = __lsx_vaddwev_w_h(p16_0, p16_1);
const __m128i s_1 = __lsx_vaddwod_w_h(p16_0, p16_1);
const __m128 q = __lsx_vffint_s_w(__lsx_vadd_w(s_0, s_1));
acc = __lsx_vfmadd_s(__lsx_vreplfr2vr_s(d), q, acc);
}
__m128 res = lsx_hadd_s(acc, acc);
res = lsx_hadd_s(res, res);
sumf = ((v4f32)res)[0];
*s = sumf;
#else
UNUSED(nb);
UNUSED(ib);
@@ -1443,6 +1472,99 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
*s = hsum_float_8(acc);
#elif defined(__loongarch_sx)
const __m128i m32s = __lsx_vreplgr2vr_b(32);
__m128 acc_0 = (__m128)__lsx_vldi(0);
__m128 acc_1 = (__m128)__lsx_vldi(0);
for (int i = 0; i < nb; ++i) {
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
const uint8_t * GGML_RESTRICT q4 = x[i].ql;
const uint8_t * GGML_RESTRICT qh = x[i].qh;
const int8_t * GGML_RESTRICT q8 = y[i].qs;
const __m128i scale_i8 = __lsx_vld(x[i].scales, 0);
const __m128i scales_lo = __lsx_vsllwil_h_b(scale_i8, 0);
const __m128i scales_hi = __lsx_vsllwil_h_b(__lsx_vbsrl_v(scale_i8, 8), 0);
__m128i sumi_0 = __lsx_vldi(0);
__m128i sumi_1 = __lsx_vldi(0);
for (int j = 0; j < QK_K/128; ++j) {
const __m128i q4bitsH_0 = __lsx_vld((const __m128i*)qh, 0); qh += 16;
const __m128i q4bitsH_1 = __lsx_vld((const __m128i*)qh, 0); qh += 16;
const __m128i q4h_0 = __lsx_vslli_b(__lsx_vandi_b(q4bitsH_0, 3), 4);
const __m128i q4h_1 = __lsx_vslli_b(__lsx_vandi_b(q4bitsH_1, 3), 4);
const __m128i q4h_2 = __lsx_vslli_b(__lsx_vandi_b(q4bitsH_0, 3 << 2), 2);
const __m128i q4h_3 = __lsx_vslli_b(__lsx_vandi_b(q4bitsH_1, 3 << 2), 2);
const __m128i q4h_4 = __lsx_vandi_b(q4bitsH_0, 3 << 4);
const __m128i q4h_5 = __lsx_vandi_b(q4bitsH_1, 3 << 4);
const __m128i q4h_6 = __lsx_vsrli_b(__lsx_vandi_b(q4bitsH_0, 3 << 6), 2);
const __m128i q4h_7 = __lsx_vsrli_b(__lsx_vandi_b(q4bitsH_1, 3 << 6), 2);
const __m128i q4bits1_0 = __lsx_vld((const __m128i*)q4, 0); q4 += 16;
const __m128i q4bits1_1 = __lsx_vld((const __m128i*)q4, 0); q4 += 16;
const __m128i q4bits2_0 = __lsx_vld((const __m128i*)q4, 0); q4 += 16;
const __m128i q4bits2_1 = __lsx_vld((const __m128i*)q4, 0); q4 += 16;
const __m128i q4_0 = __lsx_vor_v(__lsx_vandi_b(q4bits1_0, 0xf), q4h_0);
const __m128i q4_1 = __lsx_vor_v(__lsx_vandi_b(q4bits1_1, 0xf), q4h_1);
const __m128i q4_2 = __lsx_vor_v(__lsx_vandi_b(q4bits2_0, 0xf), q4h_2);
const __m128i q4_3 = __lsx_vor_v(__lsx_vandi_b(q4bits2_1, 0xf), q4h_3);
const __m128i q4_4 = __lsx_vor_v(__lsx_vsrli_b(q4bits1_0, 4), q4h_4);
const __m128i q4_5 = __lsx_vor_v(__lsx_vsrli_b(q4bits1_1, 4), q4h_5);
const __m128i q4_6 = __lsx_vor_v(__lsx_vsrli_b(q4bits2_0, 4), q4h_6);
const __m128i q4_7 = __lsx_vor_v(__lsx_vsrli_b(q4bits2_1, 4), q4h_7);
const __m128i q8_0 = __lsx_vld((const __m128i*)q8, 0); q8 += 16;
const __m128i q8_1 = __lsx_vld((const __m128i*)q8, 0); q8 += 16;
const __m128i q8_2 = __lsx_vld((const __m128i*)q8, 0); q8 += 16;
const __m128i q8_3 = __lsx_vld((const __m128i*)q8, 0); q8 += 16;
const __m128i q8_4 = __lsx_vld((const __m128i*)q8, 0); q8 += 16;
const __m128i q8_5 = __lsx_vld((const __m128i*)q8, 0); q8 += 16;
const __m128i q8_6 = __lsx_vld((const __m128i*)q8, 0); q8 += 16;
const __m128i q8_7 = __lsx_vld((const __m128i*)q8, 0); q8 += 16;
__m128i p16_0 = lsx_maddubs_h(__lsx_vsub_b(q4_0, m32s), q8_0);
__m128i p16_1 = lsx_maddubs_h(__lsx_vsub_b(q4_1, m32s), q8_1);
__m128i p16_2 = lsx_maddubs_h(__lsx_vsub_b(q4_2, m32s), q8_2);
__m128i p16_3 = lsx_maddubs_h(__lsx_vsub_b(q4_3, m32s), q8_3);
__m128i p16_4 = lsx_maddubs_h(__lsx_vsub_b(q4_4, m32s), q8_4);
__m128i p16_5 = lsx_maddubs_h(__lsx_vsub_b(q4_5, m32s), q8_5);
__m128i p16_6 = lsx_maddubs_h(__lsx_vsub_b(q4_6, m32s), q8_6);
__m128i p16_7 = lsx_maddubs_h(__lsx_vsub_b(q4_7, m32s), q8_7);
const __m128i sc_vec = j == 0 ? scales_lo : scales_hi;
p16_0 = lsx_madd_h(__lsx_vreplvei_h(sc_vec, 0), p16_0);
p16_1 = lsx_madd_h(__lsx_vreplvei_h(sc_vec, 1), p16_1);
p16_2 = lsx_madd_h(__lsx_vreplvei_h(sc_vec, 2), p16_2);
p16_3 = lsx_madd_h(__lsx_vreplvei_h(sc_vec, 3), p16_3);
p16_4 = lsx_madd_h(__lsx_vreplvei_h(sc_vec, 4), p16_4);
p16_5 = lsx_madd_h(__lsx_vreplvei_h(sc_vec, 5), p16_5);
p16_6 = lsx_madd_h(__lsx_vreplvei_h(sc_vec, 6), p16_6);
p16_7 = lsx_madd_h(__lsx_vreplvei_h(sc_vec, 7), p16_7);
sumi_0 = __lsx_vadd_w(sumi_0, __lsx_vadd_w(p16_0, p16_2));
sumi_1 = __lsx_vadd_w(sumi_1, __lsx_vadd_w(p16_1, p16_3));
sumi_0 = __lsx_vadd_w(sumi_0, __lsx_vadd_w(p16_4, p16_6));
sumi_1 = __lsx_vadd_w(sumi_1, __lsx_vadd_w(p16_5, p16_7));
}
__m128 p_0 = __lsx_vfmul_s(__lsx_vreplfr2vr_s(d), __lsx_vffint_s_w(sumi_0));
__m128 p_1 = __lsx_vfmul_s(__lsx_vreplfr2vr_s(d), __lsx_vffint_s_w(sumi_1));
acc_0 = __lsx_vfadd_s(p_0, acc_0);
acc_1 = __lsx_vfadd_s(p_1, acc_1);
}
*s = hsum_float_4x4(acc_0, acc_1, (__m128)__lsx_vldi(0), (__m128)__lsx_vldi(0));
#else
UNUSED(x);
UNUSED(y);
@@ -2149,6 +2271,35 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
*s = hsum_float_8(accum);
#elif defined(__loongarch_sx)
const __m128i values128 = __lsx_vld((const __m128i*)kvalues_iq4nl, 0);
__m128 accum = (__m128)__lsx_vldi(0);
for (int ibl = 0; ibl < nb; ++ibl) {
const uint8_t * qs = x[ibl].qs;
const int8_t * q8 = y[ibl].qs;
uint16_t sh = x[ibl].scales_h;
__m128i sumi = __lsx_vldi(0);
for (int ib = 0; ib < QK_K/32; ++ib) {
const __m128i q4bits = __lsx_vld((const __m128i*)qs, 0); qs += 16;
const __m128i q8b_0 = __lsx_vld((const __m128i*)q8, 0); q8 += 16;
const __m128i q8b_1 = __lsx_vld((const __m128i*)q8, 0); q8 += 16;
const __m128i q4b_0 = __lsx_vshuf_b(values128, values128, __lsx_vandi_b(q4bits, 0xf));
const __m128i q4b_1 = __lsx_vshuf_b(values128, values128, __lsx_vsrli_b(q4bits, 4));
const __m128i p16_0 = lsx_maddubs_h(q4b_0, q8b_0);
const __m128i p16_1 = lsx_maddubs_h(q4b_1, q8b_1);
const int16_t ls = (((x[ibl].scales_l[ib/2] >> ((ib & 1) * 4)) & 0xf) | ((sh & 0x3) << 4)) - 32;
sh >>= 2;
sumi = __lsx_vadd_w(lsx_madd_h(p16_0, __lsx_vreplgr2vr_h(ls)), sumi);
sumi = __lsx_vadd_w(lsx_madd_h(p16_1, __lsx_vreplgr2vr_h(ls)), sumi);
}
const float ds = GGML_CPU_FP16_TO_FP32(x[ibl].d) * y[ibl].d;
accum = __lsx_vfadd_s(__lsx_vfmul_s(__lsx_vreplfr2vr_s(ds), __lsx_vffint_s_w(sumi)), accum);
}
*s = ((v4f32)lsx_hadd_s(lsx_hadd_s(accum, accum), lsx_hadd_s(accum, accum)))[0];
#else
UNUSED(x);
UNUSED(y);
+35 -1
View File
@@ -2235,8 +2235,42 @@ static void ggml_compute_forward_fill_f32(const ggml_compute_params * params, gg
}
}
static void ggml_compute_forward_fill_f16(const ggml_compute_params * params, ggml_tensor * dst) {
const ggml_fp16_t c = GGML_CPU_FP32_TO_FP16(ggml_get_op_params_f32(dst, 0));
GGML_TENSOR_LOCALS(int64_t, ne, dst, ne);
GGML_TENSOR_LOCALS(size_t, nb, dst, nb);
const auto [ir0, ir1] = get_thread_range(params, dst);
for (int64_t ir = ir0; ir < ir1; ++ir) {
const int64_t i03 = ir/(ne2*ne1);
const int64_t i02 = (ir - i03*ne2*ne1)/ne1;
const int64_t i01 = (ir - i03*ne2*ne1 - i02*ne1);
ggml_fp16_t * dst_ptr = (ggml_fp16_t *) ((char *) dst->data + i03*nb3 + i02*nb2 + i01*nb1);
ggml_vec_set_f16(ne0, dst_ptr, c);
}
}
void ggml_compute_forward_fill(const ggml_compute_params * params, ggml_tensor * dst) {
ggml_compute_forward_fill_f32(params, dst);
const ggml_tensor * src0 = dst->src[0];
switch (src0->type) {
case GGML_TYPE_F32:
{
ggml_compute_forward_fill_f32(params, dst);
} break;
case GGML_TYPE_F16:
{
ggml_compute_forward_fill_f16(params, dst);
} break;
default:
{
GGML_ABORT("unsupported type for ggml_compute_forward_fill: %s", ggml_type_name(src0->type));
}
}
}
// ggml_compute_tri
+3 -16
View File
@@ -1125,25 +1125,12 @@ static inline void __lasx_f32cx8_store(ggml_fp16_t * x, __m256 y) {
#define GGML_F16_EPR 4
static inline __m128 __lsx_f16x4_load(const ggml_fp16_t * x) {
float tmp[4];
tmp[0] = GGML_CPU_FP16_TO_FP32(x[0]);
tmp[1] = GGML_CPU_FP16_TO_FP32(x[1]);
tmp[2] = GGML_CPU_FP16_TO_FP32(x[2]);
tmp[3] = GGML_CPU_FP16_TO_FP32(x[3]);
return (__m128)__lsx_vld(tmp, 0);
return __lsx_vfcvtl_s_h(__lsx_vld((const void *)x, 0));
}
static inline void __lsx_f16x4_store(ggml_fp16_t * x, __m128 y) {
float arr[4];
__lsx_vst(y, arr, 0);
x[0] = GGML_CPU_FP32_TO_FP16(arr[0]);
x[1] = GGML_CPU_FP32_TO_FP16(arr[1]);
x[2] = GGML_CPU_FP32_TO_FP16(arr[2]);
x[3] = GGML_CPU_FP32_TO_FP16(arr[3]);
__m128i a = __lsx_vfcvt_h_s(y, y);
memcpy(x, &a, sizeof(ggml_fp16_t) * 4);
}
#define GGML_F32Cx4 __m128
+58 -2
View File
@@ -7,6 +7,7 @@
#include <cstdint>
#include <cstdlib>
#include <memory>
#include <mutex>
#if defined(GGML_USE_HIP)
#define GGML_COMMON_DECL_HIP
@@ -1552,6 +1553,62 @@ struct ggml_cuda_pdl_config {
ggml_cuda_pdl_config& operator=(ggml_cuda_pdl_config&&) = delete;
};
static bool ggml_cuda_kernel_can_use_pdl(const void * kernel) {
const int device = ggml_cuda_get_device();
struct cache_key {
int device;
const void * kernel;
bool operator==(const cache_key & other) const { return device == other.device && kernel == other.kernel; }
};
struct cache_key_hash {
// MurmurHash3 mixing function for better hash distribution (vs. just std::hash which in some implementations simply returns the identity)
static size_t hash_mix(size_t x) {
std::uint64_t y = x;
const std::uint64_t m = 0xe9846af9b1a615d;
y ^= y >> 32;
y *= m;
y ^= y >> 32;
y *= m;
y ^= y >> 28;
return static_cast<size_t>(y);
}
size_t operator()(const cache_key & key) const {
// Use a nonzero seed to avoid mapping all-zero keys to zero
size_t h = 42;
h = hash_mix(h + key.device);
h = hash_mix(h + reinterpret_cast<size_t>(key.kernel));
return h;
}
};
static std::mutex cache_mutex;
static std::unordered_map<cache_key, bool, cache_key_hash> cache;
const cache_key key = { device, kernel };
std::lock_guard<std::mutex> lock(cache_mutex);
const auto it = cache.find(key);
if (it != cache.end()) {
return it->second;
}
cudaFuncAttributes attr = {};
CUDA_CHECK(cudaFuncGetAttributes(&attr, kernel));
// PDL device-side primitives are emitted only for PTX versions >= 90.
// We have to guard on a loaded kernel's PTX version so a kernel forward-JIT'ed
// from pre-Hopper PTX to a Hopper-or-newer GPU does not opt into PDL.
const bool can_use_pdl = attr.ptxVersion >= 90;
cache.emplace(key, can_use_pdl);
return can_use_pdl;
}
#endif //defined(GGML_CUDA_USE_PDL)
@@ -1564,8 +1621,7 @@ static __inline__ void ggml_cuda_kernel_launch(Kernel kernel, const ggml_cuda_ke
return env == nullptr || std::atoi(env) != 0;
}();
const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
if (env_pdl_enabled && ggml_cuda_highest_compiled_arch(cc) >= GGML_CUDA_CC_HOPPER) {
if (env_pdl_enabled && ggml_cuda_kernel_can_use_pdl(reinterpret_cast<const void *>(kernel))) {
auto pdl_cfg = ggml_cuda_pdl_config(launch_params);
CUDA_CHECK(cudaLaunchKernelEx(&pdl_cfg.cfg, kernel, std::forward<Args>(args)... ));
+2 -2
View File
@@ -1153,8 +1153,8 @@ void launch_fattn(
GGML_ASSERT(block_dim.x % warp_size == 0);
const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params(blocks_num, block_dim, nbytes_shared, main_stream);
ggml_cuda_kernel_launch(fattn_kernel, launch_params,
// disabled PDL enrollment for now due to a compiler bug.
fattn_kernel<<<blocks_num, block_dim, nbytes_shared, main_stream>>>(
(const char *) Q->data,
K_data,
V_data,
+10 -2
View File
@@ -2570,6 +2570,7 @@ static void ggml_cuda_mul_mat(ggml_backend_cuda_context & ctx, const ggml_tensor
use_mul_mat_q = use_mul_mat_q && ggml_cuda_should_use_mmq(src0->type, cc, src1->ne[1], /*n_experts=*/0);
use_mul_mat_f = use_mul_mat_f && ggml_cuda_should_use_mmf(src0->type, cc, warp_size, src0->ne, src0->nb, src1->ne[1], /*mul_mat_id=*/false);
use_mul_mat_vec_f = use_mul_mat_vec_f && ggml_cuda_should_use_mmvf(src0->type, cc, src0->ne, src0->nb, src1->ne[1]);
use_mul_mat_vec_q = use_mul_mat_vec_q && ggml_cuda_should_use_mmvq(src0->type, cc, src1->ne[1]);
any_gpus_with_slow_fp16 = any_gpus_with_slow_fp16 || !fast_fp16_hardware_available(cc);
}
} else {
@@ -2578,6 +2579,7 @@ static void ggml_cuda_mul_mat(ggml_backend_cuda_context & ctx, const ggml_tensor
use_mul_mat_q = use_mul_mat_q && ggml_cuda_should_use_mmq(src0->type, cc, src1->ne[1], /*n_experts=*/0);
use_mul_mat_f = use_mul_mat_f && ggml_cuda_should_use_mmf(src0->type, cc, warp_size, src0->ne, src0->nb, src1->ne[1], /*mul_mat_id=*/false);
use_mul_mat_vec_f = use_mul_mat_vec_f && ggml_cuda_should_use_mmvf(src0->type, cc, src0->ne, src0->nb, src1->ne[1]);
use_mul_mat_vec_q = use_mul_mat_vec_q && ggml_cuda_should_use_mmvq(src0->type, cc, src1->ne[1]);
any_gpus_with_slow_fp16 = any_gpus_with_slow_fp16 || !fast_fp16_hardware_available(cc);
}
@@ -4992,8 +4994,14 @@ static void ggml_backend_cuda_device_get_memory(ggml_backend_dev_t dev, size_t *
}
static enum ggml_backend_dev_type ggml_backend_cuda_device_get_type(ggml_backend_dev_t dev) {
GGML_UNUSED(dev);
return GGML_BACKEND_DEVICE_TYPE_GPU;
ggml_backend_cuda_device_context * ctx = (ggml_backend_cuda_device_context *) dev->context;
cudaDeviceProp prop;
CUDA_CHECK(cudaGetDeviceProperties(&prop, ctx->device));
return prop.integrated
? GGML_BACKEND_DEVICE_TYPE_IGPU
: GGML_BACKEND_DEVICE_TYPE_GPU;
}
static void ggml_backend_cuda_device_get_props(ggml_backend_dev_t dev, ggml_backend_dev_props * props) {
+81 -1
View File
@@ -63,6 +63,7 @@ static constexpr __host__ __device__ int get_vdr_mmvq(ggml_type type) {
enum mmvq_parameter_table_id {
MMVQ_PARAMETERS_GENERIC = 0,
MMVQ_PARAMETERS_TURING,
MMVQ_PARAMETERS_GCN,
MMVQ_PARAMETERS_RDNA2,
MMVQ_PARAMETERS_RDNA3_0,
@@ -78,6 +79,8 @@ static constexpr __device__ mmvq_parameter_table_id get_device_table_id() {
return MMVQ_PARAMETERS_RDNA2;
#elif defined(GCN) || defined(CDNA)
return MMVQ_PARAMETERS_GCN;
#elif defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= GGML_CUDA_CC_TURING && __CUDA_ARCH__ < GGML_CUDA_CC_AMPERE
return MMVQ_PARAMETERS_TURING;
#else
return MMVQ_PARAMETERS_GENERIC;
#endif
@@ -96,6 +99,9 @@ static __host__ mmvq_parameter_table_id get_device_table_id(int cc) {
if (GGML_CUDA_CC_IS_GCN(cc) || GGML_CUDA_CC_IS_CDNA(cc)) {
return MMVQ_PARAMETERS_GCN;
}
if (GGML_CUDA_CC_IS_NVIDIA(cc) && ggml_cuda_highest_compiled_arch(cc) >= GGML_CUDA_CC_TURING && ggml_cuda_highest_compiled_arch(cc) < GGML_CUDA_CC_AMPERE) {
return MMVQ_PARAMETERS_TURING;
}
return MMVQ_PARAMETERS_GENERIC;
}
@@ -271,6 +277,53 @@ int get_mmvq_mmid_max_batch(ggml_type type, int cc) {
return MMVQ_MAX_BATCH_SIZE;
}
bool ggml_cuda_should_use_mmvq(enum ggml_type type, int cc, int64_t ne11) {
if (GGML_CUDA_CC_IS_CDNA(cc)) {
if (GGML_CUDA_CC_IS_CDNA1(cc)) {
switch (type) {
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1:
return ne11 <= 7;
case GGML_TYPE_Q5_1:
return ne11 <= 7;
case GGML_TYPE_Q8_0:
return ne11 <= 6;
case GGML_TYPE_Q2_K:
return ne11 <= 4;
case GGML_TYPE_Q3_K:
return ne11 <= 3;
case GGML_TYPE_Q4_K:
return ne11 <= 2;
case GGML_TYPE_Q5_K:
return ne11 <= 3;
case GGML_TYPE_Q6_K:
return ne11 <= 4;
case GGML_TYPE_IQ1_S:
return ne11 <= 5;
case GGML_TYPE_IQ2_XXS:
case GGML_TYPE_IQ3_S:
case GGML_TYPE_IQ4_XS:
return ne11 <= 6;
default:
return ne11 <= MMVQ_MAX_BATCH_SIZE;
}
}
switch (type) { // tuned for CDNA2
case GGML_TYPE_Q2_K:
return ne11 <= 5;
case GGML_TYPE_Q3_K:
case GGML_TYPE_Q4_K:
case GGML_TYPE_Q5_K:
return ne11 <= 3;
case GGML_TYPE_Q6_K:
return ne11 <= 5;
default:
return ne11 <= MMVQ_MAX_BATCH_SIZE;
}
}
return ne11 <= MMVQ_MAX_BATCH_SIZE;
}
// Device constexpr: returns the max batch size for the current arch+type at compile time.
template <ggml_type type>
static constexpr __device__ int get_mmvq_mmid_max_batch_for_device() {
@@ -370,11 +423,38 @@ static constexpr __host__ __device__ int calc_nwarps(ggml_type type, int ncols_d
}
return 1;
}
if (table_id == MMVQ_PARAMETERS_TURING) {
if (ncols_dst == 1) {
switch (type) {
case GGML_TYPE_Q2_K:
case GGML_TYPE_Q3_K:
case GGML_TYPE_Q4_K:
case GGML_TYPE_Q5_K:
case GGML_TYPE_Q6_K:
return 2;
default:
return 4;
}
}
switch (ncols_dst) {
case 2:
case 3:
case 4:
return 4;
case 5:
case 6:
case 7:
case 8:
return 2;
default:
return 1;
}
}
return 1;
}
static constexpr __host__ __device__ int calc_rows_per_block(int ncols_dst, int table_id, bool small_k = false, int nwarps = 1) {
if (table_id == MMVQ_PARAMETERS_GENERIC || table_id == MMVQ_PARAMETERS_GCN) {
if (table_id == MMVQ_PARAMETERS_GENERIC || table_id == MMVQ_PARAMETERS_GCN || table_id == MMVQ_PARAMETERS_TURING) {
switch (ncols_dst) {
case 1:
return small_k ? nwarps : 1;
+2
View File
@@ -2,6 +2,8 @@
#define MMVQ_MAX_BATCH_SIZE 8 // Max. batch size for which to use MMVQ kernels.
bool ggml_cuda_should_use_mmvq(enum ggml_type type, int cc, int64_t ne11);
// Returns the maximum batch size for which MMVQ should be used for MUL_MAT_ID,
// based on the quantization type and GPU architecture (compute capability).
int get_mmvq_mmid_max_batch(ggml_type type, int cc);
+66 -77
View File
@@ -39,7 +39,7 @@
#include "ggml-hexagon.h"
#include "ggml-impl.h"
#include "ggml-quants.h"
#include "op-desc.h"
#include "htp-opnode.h"
#include "htp-ops.h"
#include "htp_iface.h"
#include "htp-drv.h"
@@ -102,23 +102,23 @@ static const char * status_to_str(uint32_t status) {
// ** debug helpers
static void ggml_hexagon_dump_op_exec(const std::string &sess_name, const ggml_tensor * op, const uint32_t req_flags) {
static void ggml_hexagon_dump_op_exec(const std::string &sess_name, const htp_opnode & node, const uint32_t req_flags) {
if (!opt_verbose) return;
op_desc desc(op);
htp_opformat fmt(node);
GGML_LOG_DEBUG("ggml-hex: %s execute-op %s: %s : %s : %s : %s : %s : flags 0x%x\n", sess_name.c_str(),
ggml_op_desc(op), desc.names, desc.dims, desc.types, desc.strides, desc.buffs, req_flags);
node.op_name().c_str(), fmt.names, fmt.dims, fmt.types, fmt.strides, fmt.buffs, req_flags);
}
static void ggml_hexagon_dump_op_supp(const std::string &sess_name, const struct ggml_tensor * op, bool supp) {
if (!opt_verbose) return;
op_desc desc(op);
htp_opformat fmt(htp_opformat(htp_opnode{const_cast<ggml_tensor*>(op), {}, HTP_OP_INVALID}));
GGML_LOG_DEBUG("ggml-hex: %s supports-op %s: %s : %s : %s : %s : %s : %s\n", sess_name.c_str(),
ggml_op_desc(op), desc.names, desc.dims, desc.types, desc.strides, desc.buffs, supp ? "yes" : "no");
ggml_op_desc(op), fmt.names, fmt.dims, fmt.types, fmt.strides, fmt.buffs, supp ? "yes" : "no");
}
static void ggml_hexagon_dump_op_prof(const std::string &sess_name, const ggml_tensor * op,
static void ggml_hexagon_dump_op_prof(const std::string &sess_name, const htp_opnode & node,
uint32_t op_usec, uint32_t op_cycles, const uint32_t pmu[]) {
if (!opt_profile) return;
@@ -129,15 +129,16 @@ static void ggml_hexagon_dump_op_prof(const std::string &sess_name, const ggml_t
pmu[0], pmu[1], pmu[2], pmu[3], pmu[4], pmu[5], pmu[6], pmu[7]);
}
op_desc desc(op);
htp_opformat fmt(node);
GGML_LOG_DEBUG("ggml-hex: %s profile-op %s: %s : %s : %s : %s : usec %u cycles %u%s\n", sess_name.c_str(),
ggml_op_desc(op), desc.names, desc.dims, desc.types, desc.strides, op_usec, op_cycles, pmu_str);
node.op_name().c_str(), fmt.names, fmt.dims, fmt.types, fmt.strides, op_usec, op_cycles, pmu_str);
}
// ** backend sessions
struct ggml_hexagon_opbatch;
struct ggml_hexagon_opqueue;
struct htp_opnode;
struct ggml_hexagon_session {
std::string name;
@@ -167,7 +168,7 @@ struct ggml_hexagon_session {
void allocate(int dev_id) noexcept(false);
void release() noexcept(true);
void enqueue_op(htp_op_code opcode, const ggml_tensor *op);
void enqueue_op(const htp_opnode & node);
void flush(bool all = true);
void flush_pending(bool all = false);
@@ -1782,12 +1783,10 @@ static ggml_backend_buffer_type_i ggml_backend_hexagon_repack_buffer_type_interf
/* .is_host = */ ggml_backend_hexagon_repack_buffer_type_is_host,
};
// Backend session implementation
struct ggml_hexagon_opbatch {
ggml_hexagon_session* sess;
std::vector<const ggml_tensor*> ops; // pointers to original ops
std::vector<htp_opnode> ops; // htp_opnode of ops
std::vector<htp_buf_desc> h_bufs; // htp buffer descriptors
std::vector<htp_tensor> h_tens; // htp tensor descriptors
@@ -1919,7 +1918,7 @@ struct ggml_hexagon_opbatch {
return ti;
}
bool fit_op(const struct ggml_tensor *t) const {
bool fit_op(const htp_opnode & node) const {
if (n_ops >= n_ops_max ) return false;
// check how much extras we will need
@@ -1939,10 +1938,10 @@ struct ggml_hexagon_opbatch {
}
};
for (unsigned int i=0; i < HTP_OP_MAX_INPUTS && t->src[i]; i++) {
fit_tensor(t->src[i]);
for (const auto * src : node.get_inputs()) {
fit_tensor(src);
}
fit_tensor(t);
fit_tensor(node.dst());
if ((extra_bufs + n_bufs) > n_bufs_max) return false;
if ((extra_tens + n_tens) > n_tens_max) return false;
@@ -1952,29 +1951,30 @@ struct ggml_hexagon_opbatch {
}
// assumes that fit_op() was called first and returned true
void add_op(htp_op_code opcode, const struct ggml_tensor * t) {
void add_op(const htp_opnode & node) {
// Add new op
unsigned int n = n_ops++;
GGML_ASSERT(n_ops <= n_ops_max);
ops[n] = t;
ops[n] = node;
htp_op_desc &o = h_ops[n];
memcpy(&o.params, &t->op_params, sizeof(t->op_params));
o.opcode = opcode;
memcpy(&o.params, &node.node->op_params, sizeof(node.node->op_params));
o.opcode = node.opcode;
o.flags = 0;
if (!(opt_opstage & HTP_OPSTAGE_COMPUTE)) {
o.flags |= HTP_OPFLAGS_SKIP_COMPUTE;
}
ggml_hexagon_dump_op_exec(sess->c_name(), t, o.flags);
ggml_hexagon_dump_op_exec(sess->c_name(), node, o.flags);
auto inputs = node.get_inputs();
for (unsigned int i=0; i < HTP_OP_MAX_INPUTS; i++) {
o.src[i] = t->src[i] ? add_tensor(t->src[i]) : 0xffff;
o.src[i] = (i < inputs.size() && inputs[i]) ? add_tensor(inputs[i]) : 0xffff;
}
o.dst = add_tensor(t);
o.dst = add_tensor(node.dst());
}
};
@@ -1983,7 +1983,7 @@ struct ggml_hexagon_opqueue {
ggml_hexagon_shared_buffer *shm_buf;
size_t shm_blk_size;
using opvec = std::vector<const ggml_tensor*>;
using opvec = std::vector<htp_opnode>;
std::queue<unsigned int> done; // completed batch ids
std::vector<opvec> op_cache; // per batch op cache
@@ -2182,11 +2182,11 @@ void ggml_hexagon_session::flush_batch() {
}
}
void ggml_hexagon_session::enqueue_op(htp_op_code opcode, const ggml_tensor *op) {
if (!op_batch->fit_op(op)) {
void ggml_hexagon_session::enqueue_op(const htp_opnode & node) {
if (!op_batch->fit_op(node)) {
flush_batch();
}
op_batch->add_op(opcode, op);
op_batch->add_op(node);
}
// Flush HTP response queue i.e wait for all outstanding requests to complete
@@ -3179,10 +3179,43 @@ static ggml_status ggml_backend_hexagon_graph_compute(ggml_backend_t backend, gg
HEX_VERBOSE("ggml-hex: %s graph-compute n_nodes %d\n", sess->c_name(), graph->n_nodes);
std::vector<htp_opnode> nodes;
nodes.reserve(graph->n_nodes);
// Fusion
for (int i = 0; i < graph->n_nodes; ++i) {
ggml_tensor * n = graph->nodes[i];
if (op_is_compute(n) && (opt_opstage & HTP_OPSTAGE_QUEUE)) {
sess->enqueue_op(op_remap_to_htp(n), n);
if (!op_is_compute(n)) {
continue;
}
ggml_tensor * next_node = (i + 1 < graph->n_nodes) ? graph->nodes[i + 1] : nullptr;
htp_opnode node = {
/*.node =*/ n,
/*.fused =*/ {},
/*.opcode =*/ HTP_OP_INVALID
};
if (n->op == GGML_OP_RMS_NORM && next_node) {
if (next_node->op == GGML_OP_MUL && op_is_compute(next_node) && ggml_can_fuse(graph, i, { GGML_OP_RMS_NORM, GGML_OP_MUL })) {
node.add_fused(next_node);
node.opcode = HTP_OP_RMS_NORM_MUL;
i++; // skip the fused MUL node
}
}
if (node.opcode == HTP_OP_INVALID) {
node.opcode = op_remap_to_htp(n);
}
nodes.push_back(std::move(node));
}
// Queue and execute
if (opt_opstage & HTP_OPSTAGE_QUEUE) {
for (const auto & node : nodes) {
sess->enqueue_op(node);
}
}
@@ -3201,51 +3234,7 @@ static void ggml_backend_hexagon_synchronize(ggml_backend_t backend) {
sess->flush();
}
struct node_info {
ggml_tensor * node;
std::vector<ggml_tensor *> fused;
ggml_op op() const {
return node->op;
}
const ggml_tensor * dst() const {
return fused.empty() ? node : fused.back();
}
const ggml_tensor * src0() const {
return node->src[0];
}
const ggml_tensor * src1() const {
return node->src[1];
}
bool is_empty() const {
return ggml_op_is_empty(node->op);
}
void add_fused(ggml_tensor * t) {
fused.push_back(t);
}
bool stackable() const {
switch (this->op()) {
case GGML_OP_MUL_MAT:
case GGML_OP_MUL_MAT_ID:
return ggml_is_quantized(this->src0()->type);
default:
return false;
}
}
bool same_input(const node_info& n) const {
return n.src1() == this->src1();
}
};
static std::vector<int> ggml_hexagon_graph_optimize_reorder(const std::vector<node_info> & nodes) {
static std::vector<int> ggml_hexagon_graph_optimize_reorder(const std::vector<htp_opnode> & nodes) {
const int n = nodes.size();
std::vector<int> res;
@@ -3299,14 +3288,14 @@ static void ggml_backend_hexagon_graph_optimize(ggml_backend_t backend, ggml_cgr
enum ggml_op ops[MAX_FUSE];
std::vector<node_info> nodes;
std::vector<htp_opnode> nodes;
nodes.reserve(gf->n_nodes);
// fuse nodes:
// we don't want to make reorders that break fusing, so we first pack all fusable tensors
// and perform the reorder over the fused nodes. after the reorder is done, we unfuse
for (int i = 0; i < n; i++) {
node_info node = {
htp_opnode node = {
/*.node =*/gf->nodes[i],
/*.fused =*/{},
};
+241
View File
@@ -0,0 +1,241 @@
#ifndef HTP_OPNODE_H
#define HTP_OPNODE_H
#define GGML_COMMON_IMPL_CPP
#include "ggml-backend-impl.h"
#include "ggml-common.h"
#include <string>
#include <vector>
#include <stdio.h>
#include "htp-ops.h"
struct htp_opnode {
ggml_tensor * node = nullptr;
std::vector<ggml_tensor *> fused;
htp_op_code opcode = HTP_OP_INVALID;
ggml_op op() const {
return node->op;
}
const ggml_tensor * dst() const {
return fused.empty() ? node : fused.back();
}
const ggml_tensor * src0() const {
return node->src[0];
}
const ggml_tensor * src1() const {
return node->src[1];
}
bool is_empty() const {
return ggml_op_is_empty(node->op);
}
void add_fused(ggml_tensor * t) {
fused.push_back(t);
}
bool stackable() const {
switch (this->op()) {
case GGML_OP_MUL_MAT:
case GGML_OP_MUL_MAT_ID:
return ggml_is_quantized(this->src0()->type);
default:
return false;
}
}
bool same_input(const htp_opnode& n) const {
return n.src1() == this->src1();
}
std::vector<const ggml_tensor *> get_inputs() const {
std::vector<const ggml_tensor *> inputs;
std::vector<const ggml_tensor *> outputs;
outputs.push_back(node);
for (const auto * f : fused) {
outputs.push_back(f);
}
auto contains = [&](const std::vector<const ggml_tensor *> & vec, const ggml_tensor * t) {
for (const auto * x : vec) {
if (x == t) return true;
}
return false;
};
auto add_input = [&](const ggml_tensor * t) {
if (t && !contains(outputs, t) && !contains(inputs, t)) {
inputs.push_back(t);
}
};
for (int i = 0; i < GGML_MAX_SRC && node->src[i]; i++) {
add_input(node->src[i]);
}
for (const auto * f : fused) {
for (int i = 0; i < GGML_MAX_SRC && f->src[i]; i++) {
add_input(f->src[i]);
}
}
return inputs;
}
std::string op_name() const {
if (fused.empty()) {
return ggml_op_desc(node);
}
std::string name = ggml_op_desc(node);
for (const auto * f : fused) {
name += "+";
name += ggml_op_desc(f);
}
return name;
}
};
struct htp_opformat {
char strides[64 * GGML_MAX_SRC];
char dims[64 * GGML_MAX_SRC];
char types[16 * GGML_MAX_SRC];
char buffs[64 * GGML_MAX_SRC];
char names[64 * GGML_MAX_SRC];
int format_tensor_dims(char * str, const struct ggml_tensor * t) {
if (t->ne[2] == 1 && t->ne[3] == 1) {
return sprintf(str, "%d:%d", (int) t->ne[0], (int) t->ne[1]);
} else {
return sprintf(str, "%d:%d:%d:%d", (int) t->ne[0], (int) t->ne[1], (int) t->ne[2], (int) t->ne[3]);
}
}
void format_op_dims(char * str, const htp_opnode & node) {
char * p = str;
auto inputs = node.get_inputs();
if (!inputs.empty()) {
p += format_tensor_dims(p, inputs[0]);
for (size_t i = 1; i < inputs.size(); i++) {
p += sprintf(p, " x ");
p += format_tensor_dims(p, inputs[i]);
}
p += sprintf(p, " -> ");
}
char self[64];
format_tensor_dims(self, node.dst());
p += sprintf(p, "%s", self);
}
int format_tensor_strides(char * str, const struct ggml_tensor * t) {
const char * c = ggml_is_contiguous(t) ? "" : "!";
if (t->ne[2] == 1 && t->ne[3] == 1) {
return sprintf(str, "%zu:%zu%s", (size_t) t->nb[0], (size_t) t->nb[1], c);
} else {
return sprintf(str, "%zu:%zu:%zu:%zu%s", (size_t) t->nb[0], (size_t) t->nb[1], (size_t) t->nb[2], (size_t) t->nb[3], c);
}
}
void format_op_strides(char * str, const htp_opnode & node) {
char * p = str;
auto inputs = node.get_inputs();
if (!inputs.empty()) {
p += format_tensor_strides(p, inputs[0]);
for (size_t i = 1; i < inputs.size(); i++) {
p += sprintf(p, " x ");
p += format_tensor_strides(p, inputs[i]);
}
p += sprintf(p, " -> ");
}
char self[64];
format_tensor_strides(self, node.dst());
p += sprintf(p, "%s", self);
}
void format_op_types(char * str, const htp_opnode & node) {
char * p = str;
auto inputs = node.get_inputs();
if (!inputs.empty()) {
p += sprintf(p, "%s", ggml_type_name(inputs[0]->type));
for (size_t i = 1; i < inputs.size(); i++) {
p += sprintf(p, " x ");
p += sprintf(p, "%s", ggml_type_name(inputs[i]->type));
}
p += sprintf(p, " -> ");
}
p += sprintf(p, "%s", ggml_type_name(node.dst()->type));
}
const char * tensor_buff_name(const struct ggml_tensor * t) {
if (t->buffer) {
return ggml_backend_buffer_name(t->buffer);
}
return "NONE";
}
void format_op_buffs(char * str, const htp_opnode & node) {
char * p = str;
auto inputs = node.get_inputs();
if (!inputs.empty()) {
p += sprintf(p, "%s", tensor_buff_name(inputs[0]));
for (size_t i = 1; i < inputs.size(); i++) {
p += sprintf(p, " x ");
p += sprintf(p, "%s", tensor_buff_name(inputs[i]));
}
p += sprintf(p, " -> ");
}
p += sprintf(p, "%s", tensor_buff_name(node.dst()));
}
void format_op_names(char * str, const htp_opnode & node) {
char * p = str;
auto inputs = node.get_inputs();
if (!inputs.empty()) {
p += sprintf(p, "%s", inputs[0]->name);
for (size_t i = 1; i < inputs.size(); i++) {
p += sprintf(p, " x ");
p += sprintf(p, "%s", inputs[i]->name);
}
p += sprintf(p, " -> ");
}
p += sprintf(p, "%s", node.dst()->name);
}
void format(const htp_opnode & node) {
format_op_dims(dims, node);
format_op_strides(strides, node);
format_op_types(types, node);
format_op_buffs(buffs, node);
format_op_names(names, node);
}
htp_opformat() {}
htp_opformat(const htp_opnode & node) { format(node); }
};
#endif // HTP_OPNODE_H
+2 -1
View File
@@ -58,15 +58,16 @@ list(FIND HTP_HMX_VERSIONS ${DSP_VERSION} _hmx_idx)
if (_hmx_idx GREATER_EQUAL 0)
target_sources(${HTP_LIB} PRIVATE
hmx-queue.c
hmx-flash-attn-ops.c
hmx-matmul-ops.c
hmx-queue.c
)
# -mhmx enables HMX instruction set (needed by files that include hmx-utils.h)
set_source_files_properties(
hmx-flash-attn-ops.c
hmx-matmul-ops.c
hmx-queue.c
PROPERTIES COMPILE_OPTIONS "-mhmx"
)
+81 -76
View File
@@ -22,6 +22,16 @@
// Must be multiple of 32
#define FLASH_ATTN_BLOCK_SIZE (32 * 2)
#if __HVX_ARCH__ < 79
#define HVX_OP_ADD_F32(a, b) Q6_Vsf_equals_Vqf32(Q6_Vqf32_vadd_VsfVsf(a, b))
#define HVX_OP_SUB_F32(a, b) Q6_Vsf_equals_Vqf32(Q6_Vqf32_vsub_VsfVsf(a, b))
#define HVX_OP_MUL_F32(a, b) Q6_Vsf_equals_Vqf32(Q6_Vqf32_vmpy_VsfVsf(a, b))
#else
#define HVX_OP_ADD_F32(a, b) Q6_Vsf_vadd_VsfVsf(a, b)
#define HVX_OP_SUB_F32(a, b) Q6_Vsf_vsub_VsfVsf(a, b)
#define HVX_OP_MUL_F32(a, b) Q6_Vsf_vmpy_VsfVsf(a, b)
#endif
// This is a bit of a hack because the compiler is strugling to properly inline
// the default hvx_vec_f32_to_f16 with output into the local array.
static __attribute__((noinline)) void hvx_vec_f32_to_f16_a(void *ptr, HVX_Vector v0, HVX_Vector v1)
@@ -54,8 +64,8 @@ static inline void hvx_dot_f16_f16_aa(float * restrict r, const void * restrict
rsum_p = hvx_vec_mpyacc_f32_f16(rsum_p, x_hf, y_hf);
}
HVX_Vector rsum = Q6_Vsf_equals_Vqf32(Q6_Vqf32_vadd_VsfVsf(Q6_V_lo_W(rsum_p), Q6_V_hi_W(rsum_p)));
rsum = Q6_Vsf_equals_Vqf32(Q6_Vqf32_vmpy_VsfVsf(hvx_vec_splat_f32(s), hvx_vec_reduce_sum_f32(rsum)));
HVX_Vector rsum = HVX_OP_ADD_F32(Q6_V_lo_W(rsum_p), Q6_V_hi_W(rsum_p));
rsum = HVX_OP_MUL_F32(hvx_vec_splat_f32(s), hvx_vec_reduce_sum_f32(rsum));
hvx_vec_store_u(r, 4, rsum);
}
@@ -105,10 +115,10 @@ static inline HVX_Vector hvx_dot_f16_f16_aa_rx4(const void * restrict y,
rsum3_p = hvx_vec_mpyacc_f32_f16(rsum3_p, x3_hf, y_hf);
}
HVX_Vector rsum0 = Q6_Vsf_equals_Vqf32(Q6_Vqf32_vadd_VsfVsf(Q6_V_lo_W(rsum0_p), Q6_V_hi_W(rsum0_p)));
HVX_Vector rsum1 = Q6_Vsf_equals_Vqf32(Q6_Vqf32_vadd_VsfVsf(Q6_V_lo_W(rsum1_p), Q6_V_hi_W(rsum1_p)));
HVX_Vector rsum2 = Q6_Vsf_equals_Vqf32(Q6_Vqf32_vadd_VsfVsf(Q6_V_lo_W(rsum2_p), Q6_V_hi_W(rsum2_p)));
HVX_Vector rsum3 = Q6_Vsf_equals_Vqf32(Q6_Vqf32_vadd_VsfVsf(Q6_V_lo_W(rsum3_p), Q6_V_hi_W(rsum3_p)));
HVX_Vector rsum0 = HVX_OP_ADD_F32(Q6_V_lo_W(rsum0_p), Q6_V_hi_W(rsum0_p));
HVX_Vector rsum1 = HVX_OP_ADD_F32(Q6_V_lo_W(rsum1_p), Q6_V_hi_W(rsum1_p));
HVX_Vector rsum2 = HVX_OP_ADD_F32(Q6_V_lo_W(rsum2_p), Q6_V_hi_W(rsum2_p));
HVX_Vector rsum3 = HVX_OP_ADD_F32(Q6_V_lo_W(rsum3_p), Q6_V_hi_W(rsum3_p));
HVX_Vector_x4 rsum0123 = { .v = { rsum0, rsum1, rsum2, rsum3 } };
return hvx_vec_reduce_sum_f32x4(rsum0123);
@@ -123,7 +133,7 @@ static inline HVX_Vector hvx_dot_f16_f16_aa_rx32(const void * restrict y,
const size_t nvec = n / VLEN_FP16; // num full fp16 hvx vectors
const size_t nloe = n % VLEN_FP16; // leftover elements
HVX_Vector sums; // initialize at j = 0
HVX_Vector sums = Q6_V_vzero();
const size_t stride_x_4 = stride_x * 4;
for (uint32_t j = 0; j < VLEN_FP32; j += 4) {
HVX_Vector sums_x4 = hvx_dot_f16_f16_aa_rx4(y, x, stride_x, nvec, nloe);
@@ -132,8 +142,7 @@ static inline HVX_Vector hvx_dot_f16_f16_aa_rx32(const void * restrict y,
x += stride_x_4;
}
sums = Q6_Vqf32_vmpy_VsfVsf(hvx_vec_splat_f32(s), sums);
return Q6_Vsf_equals_Vqf32(sums);
return HVX_OP_MUL_F32(hvx_vec_splat_f32(s), sums);
}
// MAD: y (F32) += x (F16) * s (F16)
@@ -268,11 +277,10 @@ static inline void hvx_scale_vec_f32_aa(uint8_t * restrict dst, const uint8_t *
uint32_t i = 0;
#pragma unroll(4)
for (; i < nvec; ++i) {
vdst[i] = Q6_Vsf_equals_Vqf32(Q6_Vqf32_vmpy_VsfVsf(vsrc[i], vs));
vdst[i] = HVX_OP_MUL_F32(vsrc[i], vs);
}
if (nloe) {
HVX_Vector v = Q6_Vqf32_vmpy_VsfVsf(vsrc[i], vs);
hvx_vec_store_a(&vdst[i], nloe * sizeof(float), Q6_Vsf_equals_Vqf32(v));
hvx_vec_store_a(&vdst[i], nloe * sizeof(float), HVX_OP_MUL_F32(vsrc[i], vs));
}
}
@@ -438,25 +446,44 @@ static void flash_attn_ext_f16_thread(unsigned int nth, unsigned int ith, void *
// Process in sub-blocks of 32 (VLEN_FP32)
HVX_Vector sb_scores[FLASH_ATTN_BLOCK_SIZE / VLEN_FP32];
HVX_Vector v_max = hvx_vec_splat_f32(-INFINITY);
for (uint32_t iv = 0; ic + VLEN_FP32 <= current_block_size; ic += VLEN_FP32, ++iv) {
for (uint32_t iv = 0; ic < current_block_size; ic += VLEN_FP32, ++iv) {
// 1. Compute scores
HVX_Vector scores = hvx_dot_f16_f16_aa_rx32(q_ptr_vtcm, k_base + ic * factx->size_k_row_padded, factx->size_k_row_padded, DK, factx->scale);
// 2. Softcap
if (factx->logit_softcap != 0.0f) {
scores = hvx_vec_tanh_f32(scores);
scores = Q6_Vqf32_vmpy_VsfVsf(scores, logit_cap);
scores = Q6_Vsf_equals_Vqf32(scores);
scores = HVX_OP_MUL_F32(scores, logit_cap);
}
// 3. Mask
if (mask) {
const __fp16 * mp = m_base + ic;
HVX_Vector m_vals_f16 = *(const HVX_UVector *) mp;
HVX_VectorPair m_vals_f32_pair = Q6_Wqf32_vmpy_VhfVhf(Q6_Vh_vshuff_Vh(m_vals_f16), slope_vec);
HVX_Vector add_val = Q6_V_lo_W(m_vals_f32_pair);
scores = Q6_Vqf32_vadd_Vqf32Vsf(add_val, scores);
scores = Q6_Vsf_equals_Vqf32(scores);
// Multiplying -INFINITY (0xFC00) by a slope in VhfVhf instructions can incorrectly produce NaN on v79.
// Clamp -INFINITY to the max negative fp16 finite value (-65504.0f).
HVX_Vector vinf = Q6_Vh_vsplat_R(0xFC00);
HVX_Vector vmin = Q6_Vh_vsplat_R(0xFBFF);
HVX_VectorPred is_inf = Q6_Q_vcmp_eq_VhVh(m_vals_f16, vinf);
m_vals_f16 = Q6_V_vmux_QVV(is_inf, vmin, m_vals_f16);
#if __HVX_ARCH__ >= 79
HVX_VectorPair m_vals_f32_pair = Q6_Wsf_vmpy_VhfVhf(Q6_Vh_vshuff_Vh(m_vals_f16), slope_vec);
HVX_Vector add_val = Q6_V_lo_W(m_vals_f32_pair);
scores = Q6_Vsf_vadd_VsfVsf(add_val, scores);
#else
HVX_VectorPair m_vals_f32_pair = Q6_Wqf32_vmpy_VhfVhf(Q6_Vh_vshuff_Vh(m_vals_f16), slope_vec);
HVX_Vector add_val = Q6_V_lo_W(m_vals_f32_pair);
scores = Q6_Vsf_equals_Vqf32(Q6_Vqf32_vadd_Vqf32Vsf(add_val, scores));
#endif
}
// Mask out invalid lanes for leftover handling
uint32_t valid_lanes = current_block_size - ic;
if (valid_lanes < VLEN_FP32) {
HVX_VectorPred valid_pred = Q6_Q_vsetq_R(valid_lanes * 4); // 4 bytes per fp32 lane
scores = Q6_V_vmux_QVV(valid_pred, scores, hvx_vec_splat_f32(-INFINITY));
}
sb_scores[iv] = scores;
@@ -466,78 +493,55 @@ static void flash_attn_ext_f16_thread(unsigned int nth, unsigned int ith, void *
{
// 4. Online Softmax Update
HVX_Vector M_new_vec = Q6_Vsf_vmax_VsfVsf(v_max, M_vec);
HVX_Vector diff_vec = Q6_Vsf_equals_Vqf32(Q6_Vqf32_vsub_VsfVsf(M_vec, M_new_vec));
HVX_Vector diff_vec = HVX_OP_SUB_F32(M_vec, M_new_vec);
HVX_Vector ms_vec = hvx_vec_exp_f32(diff_vec);
M_vec = M_new_vec;
hvx_scale_vec_f32_aa((uint8_t *) VKQ32, (const uint8_t *) VKQ32, DV, ms_vec);
HVX_Vector p_sum_vec = hvx_vec_splat_f32(0.0f);
for (uint32_t ic2 = 0, iv = 0; ic2 + VLEN_FP32 <= current_block_size; ic2 += VLEN_FP32, ++iv) {
for (uint32_t ic2 = 0, iv = 0; ic2 < current_block_size; ic2 += VLEN_FP32, ++iv) {
HVX_Vector scores = sb_scores[iv];
HVX_Vector scores_shifted = Q6_Vqf32_vsub_VsfVsf(scores, M_vec);
HVX_Vector P = hvx_vec_exp_f32(Q6_Vsf_equals_Vqf32(scores_shifted));
HVX_Vector scores_shifted = HVX_OP_SUB_F32(scores, M_vec);
HVX_Vector P = hvx_vec_exp_f32(scores_shifted);
p_sum_vec = Q6_Vsf_equals_Vqf32(Q6_Vqf32_vadd_VsfVsf(p_sum_vec, P));
p_sum_vec = HVX_OP_ADD_F32(p_sum_vec, P);
// 5. Accumulate V
__fp16 __attribute__((aligned(VLEN))) p_arr[VLEN_FP16];
hvx_vec_f32_to_f16_a(p_arr, P, hvx_vec_splat_f32(0));
float __attribute__((aligned(128))) P_arr[VLEN_FP32];
hvx_vec_store_a(P_arr, 128, P);
for (uint32_t j = 0; j < VLEN_FP32; j += 2) {
const uint32_t cur_ic = ic2 + j;
const uint8_t * v_ptr = v_base + cur_ic * factx->size_v_row_padded;
const uint32_t cur_ic = ic2 + j;
if (cur_ic >= current_block_size) {
break;
}
if (cur_ic + 1 == current_block_size) {
// Odd leftover, process single row
if (P_arr[j] != 0.0f) {
const uint8_t * v_ptr = v_base + cur_ic * factx->size_v_row_padded;
hvx_mad_f32_f16_aa(VKQ32, v_ptr, (p_arr + j), DV);
}
break;
}
// Avoid NaN * 0.0 = NaN for uninitialized V cache rows.
// Check the f32 values to safely avoid strict aliasing violations.
if (P_arr[j] == 0.0f && P_arr[j + 1] == 0.0f) {
continue;
}
const uint8_t * v_ptr = v_base + cur_ic * factx->size_v_row_padded;
hvx_mad_f32_f16_aa_rx2(VKQ32, v_ptr, v_ptr + factx->size_v_row_padded, (p_arr + j), (p_arr + j + 1), DV);
}
}
p_sum_vec = hvx_vec_reduce_sum_f32(p_sum_vec);
S_vec = Q6_Vsf_equals_Vqf32(Q6_Vqf32_vadd_VsfVsf(Q6_Vsf_equals_Vqf32(Q6_Vqf32_vmpy_VsfVsf(S_vec, ms_vec)), p_sum_vec));
}
if (ic < current_block_size) {
// Sync scalars for leftover/next block if needed
float M = hvx_vec_get_f32(M_vec);
float S = hvx_vec_get_f32(S_vec);
// Leftover
for (; ic < current_block_size; ++ic) {
float s_val;
const uint8_t * k_ptr = k_base + ic * factx->size_k_row_padded;
hvx_dot_f16_f16_aa(&s_val, q_ptr_vtcm, k_ptr, DK, factx->scale);
if (factx->logit_softcap != 0.0f) {
s_val = factx->logit_softcap * tanhf(s_val);
}
if (mask) {
const float m_val = m_base[ic];
s_val += slope * m_val;
}
const float Mold = M;
__fp16 vs = 1.0f;
if (s_val > M) {
M = s_val;
HVX_Vector diff_vec = hvx_vec_splat_f32(Mold - M);
HVX_Vector ms_vec = hvx_vec_exp_f32(diff_vec);
hvx_scale_vec_f32_aa((uint8_t *) VKQ32, (const uint8_t *) VKQ32, DV, ms_vec);
float ms = hvx_vec_get_f32(ms_vec);
S = S * ms + vs;
} else {
HVX_Vector diff_vec = hvx_vec_splat_f32(s_val - M);
vs = hvx_vec_get_f32(hvx_vec_exp_f32(diff_vec));
S += vs;
}
const uint8_t * v_ptr = v_base + ic * factx->size_v_row_padded;
hvx_mad_f32_f16_aa(VKQ32, v_ptr, &vs, DV);
}
M_vec = hvx_vec_splat_f32(M);
S_vec = hvx_vec_splat_f32(S);
S_vec = HVX_OP_ADD_F32(HVX_OP_MUL_F32(S_vec, ms_vec), p_sum_vec);
}
// Issue DMA for next+1 block (if exists)
@@ -599,8 +603,9 @@ static void flash_attn_ext_f16_thread(unsigned int nth, unsigned int ith, void *
const int i2 = iq2;
const int i3 = iq3;
// dst is permuted
uint8_t * dst_ptr = (uint8_t *) dst->data + (i3*ne2*ne1 + i2 + i1*ne1) * nb1;
// dst is permuted: [DV, n_heads, n_tokens, n_seq]
// head stride is nb[1], token stride is nb[2], batch stride is nb[3]
uint8_t * dst_ptr = (uint8_t *) dst->data + i2 * dst->nb[1] + i1 * dst->nb[2] + i3 * dst->nb[3];
if (dst->type == HTP_TYPE_F32) {
hvx_copy_f32_ua(dst_ptr, (uint8_t *) VKQ32, DV);
@@ -623,8 +628,8 @@ int op_flash_attn_ext(struct htp_ops_context * octx) {
}
#ifdef HTP_HAS_HMX
// HMX path: prefill (neq1 >= 32), head_dim multiple of 32, F16 KV
if (k->type == HTP_TYPE_F16 && v->type == HTP_TYPE_F16 && k->ne[0] % 32 == 0 && q->ne[1] >= 32) {
// HMX path: head_dim multiple of 32, F16 KV
if (k->type == HTP_TYPE_F16 && v->type == HTP_TYPE_F16 && k->ne[0] % 32 == 0) {
int ret = hmx_flash_attn_ext(octx);
if (ret == HTP_STATUS_OK) {
return ret;
@@ -1248,9 +1248,6 @@ int hmx_flash_attn_ext(struct htp_ops_context * octx) {
if (DK % 32 != 0 || DV % 32 != 0) {
return HTP_STATUS_NO_SUPPORT;
}
if (neq1 < 32) {
return HTP_STATUS_NO_SUPPORT;
}
// GQA factor
const uint32_t n_kv_heads = k->ne[2];
+288 -207
View File
@@ -16,6 +16,7 @@
#include "ggml-common.h"
#include "hex-dma.h"
#include "hex-fastdiv.h"
#include "worker-pool.h"
#include "hvx-utils.h"
@@ -187,45 +188,44 @@ next_nc:
// In x4x2, sub-blocks 0..3 use lower nibbles, sub-blocks 4..7 use upper nibbles
// of the same 32 packed bytes.
static inline HVX_Vector dequantize_x4x2_q4_0_group_hvx(const uint8_t *packed_32, bool upper_nibbles, const __fp16 *scale, const HVX_Vector vlut_cvt) {
(void)vlut_cvt;
HVX_Vector vq = hvx_vmemu(packed_32);
const HVX_Vector mask_h4 = Q6_Vb_vsplat_R(0x0F);
const HVX_Vector i8 = Q6_Vb_vsplat_R(8);
HVX_Vector v_scales = hvx_vec_repl_f16(hvx_vmemu(scale));
// q4x4x2 stores two int4 values per byte. Keep only the selected nibble.
HVX_Vector v_quants = Q6_Vub_vlsr_VubR(vq, 4 * upper_nibbles);
HVX_Vector v_quants = Q6_Vub_vlsr_VubR(vq, 4 * upper_nibbles);
v_quants = Q6_V_vand_VV(v_quants, mask_h4);
// Shuffle before LUT
v_quants = Q6_Vb_vshuff_Vb(v_quants);
// Use standard vlut16 (not _nomatch) to avoid stale-register NaN.
// _nomatch retains the previous destination-register value for colliding
// indices, but the C intrinsic doesn't model the implicit read so the
// compiler may allocate a register containing garbage/NaN.
HVX_VectorPair vp = Q6_Wh_vlut16_VbVhR(v_quants, vlut_cvt, 0);
HVX_Vector v_hf = Q6_V_lo_W(vp);
HVX_Vector v_int8 = Q6_Vb_vsub_VbVb(v_quants, i8);
HVX_Vector v0 = Q6_V_lo_W(Q6_Wh_vunpack_Vb(v_int8));
HVX_Vector v_hf = Q6_Vhf_equals_Vh(v0);
return Q6_Vhf_equals_Vqf16(Q6_Vqf16_vmpy_VhfVhf(v_hf, v_scales));
}
// Batch-dequantize 4 contiguous x4x2 Q4_0 groups (4x32 = 128 packed bytes) using
// full HVX vector width. One vmemu + one vlut16 replaces 4 separate calls.
// full HVX vector width.
// Output: vector_x2 each hold 32 FP16 values in the first 64 bytes.
static inline HVX_Vector_x2 dequantize_x4x2_q4_0_x4groups_hvx(
const uint8_t *packed_128, bool upper_nibbles,
const __fp16 *scales_4, const HVX_Vector vlut_cvt) {
// Load all 128 packed bytes (4 contiguous 32-byte groups)
(void)vlut_cvt;
HVX_Vector vq = hvx_vmemu(packed_128);
const HVX_Vector mask_h4 = Q6_Vb_vsplat_R(0x0F);
const HVX_Vector i8 = Q6_Vb_vsplat_R(8);
HVX_Vector v_quants = Q6_Vub_vlsr_VubR(vq, 4 * upper_nibbles);
v_quants = Q6_V_vand_VV(v_quants, mask_h4);
// Shuffle before LUT
v_quants = Q6_Vb_vshuff_Vb(v_quants);
HVX_Vector v_int8 = Q6_Vb_vsub_VbVb(v_quants, i8);
// Full-width vlut16: 128 byte lookups -> 128 fp16 results in a VectorPair
HVX_VectorPair vp = Q6_Wh_vlut16_VbVhR(v_quants, vlut_cvt, 0);
HVX_Vector v_lo = Q6_V_lo_W(vp); // [group0: 32 fp16 | group1: 32 fp16]
HVX_Vector v_hi = Q6_V_hi_W(vp); // [group2: 32 fp16 | group3: 32 fp16]
HVX_VectorPair vp_int16 = Q6_Wh_vunpack_Vb(v_int8);
HVX_Vector v_lo = Q6_V_lo_W(vp_int16);
HVX_Vector v_hi = Q6_V_hi_W(vp_int16);
v_lo = Q6_Vhf_equals_Vh(v_lo);
v_hi = Q6_Vhf_equals_Vh(v_hi);
// Build per-group scale vectors: first 64 bytes use scale_a, last 64 use scale_b
HVX_Vector vscale = hvx_vmemu(scales_4);
HVX_Vector v_sc01 = hvx_vec_repl_2x_f16(vscale);
HVX_Vector v_sc23 = hvx_vec_repl_2x_f16(Q6_V_vror_VR(vscale, 4));
@@ -233,13 +233,12 @@ static inline HVX_Vector_x2 dequantize_x4x2_q4_0_x4groups_hvx(
v_lo = Q6_Vhf_equals_Vqf16(Q6_Vqf16_vmpy_VhfVhf(v_lo, v_sc01));
v_hi = Q6_Vhf_equals_Vqf16(Q6_Vqf16_vmpy_VhfVhf(v_hi, v_sc23));
// Extract individual groups: scatter uses q_mask64 so only first 64 bytes matter
HVX_Vector_x2 r = { v_lo,/* group1 already in [0:63] */
v_hi /* group2 already in [0:63] */ };
HVX_Vector_x2 r = { v_lo, v_hi };
return r;
}
static inline HVX_Vector dequantize_x4x2_q4_1_group_hvx(const uint8_t *packed_32, bool upper_nibbles, const __fp16 *scale_offset, const HVX_Vector vlut_cvt) {
(void)vlut_cvt;
HVX_Vector vq = hvx_vmemu(packed_32);
const HVX_Vector mask_h4 = Q6_Vb_vsplat_R(0x0F);
HVX_Vector v_dm = hvx_vmemu(scale_offset);
@@ -248,9 +247,9 @@ static inline HVX_Vector dequantize_x4x2_q4_1_group_hvx(const uint8_t *packed_32
HVX_Vector v_quants = Q6_Vub_vlsr_VubR(vq, 4 * upper_nibbles);
v_quants = Q6_V_vand_VV(v_quants, mask_h4);
v_quants = Q6_Vb_vshuff_Vb(v_quants);
HVX_VectorPair vp = Q6_Wh_vlut16_VbVhR(v_quants, vlut_cvt, 0);
HVX_Vector v_hf = Q6_V_lo_W(vp);
HVX_Vector v0 = Q6_V_lo_W(Q6_Wh_vunpack_Vb(v_quants));
HVX_Vector v_hf = Q6_Vhf_equals_Vh(v0);
return Q6_Vhf_equals_Vqf16(Q6_Vqf16_vadd_Vqf16Vhf(Q6_Vqf16_vmpy_VhfVhf(v_hf, v_scales), v_offsets));
}
@@ -258,16 +257,18 @@ static inline HVX_Vector dequantize_x4x2_q4_1_group_hvx(const uint8_t *packed_32
static inline HVX_Vector_x2 dequantize_x4x2_q4_1_x4groups_hvx(
const uint8_t *packed_128, bool upper_nibbles,
const __fp16 *scales_offsets_4, const HVX_Vector vlut_cvt) {
(void)vlut_cvt;
HVX_Vector vq = hvx_vmemu(packed_128);
const HVX_Vector mask_h4 = Q6_Vb_vsplat_R(0x0F);
HVX_Vector v_quants = Q6_Vub_vlsr_VubR(vq, 4 * upper_nibbles);
v_quants = Q6_V_vand_VV(v_quants, mask_h4);
v_quants = Q6_Vb_vshuff_Vb(v_quants);
HVX_VectorPair vp_int16 = Q6_Wh_vunpack_Vb(v_quants);
HVX_Vector v_lo = Q6_V_lo_W(vp_int16);
HVX_Vector v_hi = Q6_V_hi_W(vp_int16);
HVX_VectorPair vp = Q6_Wh_vlut16_VbVhR(v_quants, vlut_cvt, 0);
HVX_Vector v_lo = Q6_V_lo_W(vp);
HVX_Vector v_hi = Q6_V_hi_W(vp);
v_lo = Q6_Vhf_equals_Vh(v_lo);
v_hi = Q6_Vhf_equals_Vh(v_hi);
HVX_Vector vscale_offset = hvx_vmemu(scales_offsets_4);
HVX_VectorPair dm_deal = Q6_W_vdeal_VVR(vscale_offset, vscale_offset, -2);
@@ -287,6 +288,45 @@ static inline HVX_Vector_x2 dequantize_x4x2_q4_1_x4groups_hvx(
return r;
}
// LUT-based dequantizers for non-linear IQ4_NL format.
static inline HVX_Vector dequantize_x4x2_iq4_nl_group_hvx(const uint8_t *packed_32, bool upper_nibbles, const __fp16 *scale, const HVX_Vector vlut_cvt) {
HVX_Vector vq = hvx_vmemu(packed_32);
const HVX_Vector mask_h4 = Q6_Vb_vsplat_R(0x0F);
HVX_Vector v_scales = hvx_vec_repl_f16(hvx_vmemu(scale));
HVX_Vector v_quants = Q6_Vub_vlsr_VubR(vq, 4 * upper_nibbles);
v_quants = Q6_V_vand_VV(v_quants, mask_h4);
v_quants = Q6_Vb_vshuff_Vb(v_quants);
HVX_VectorPair vp = Q6_Wh_vlut16_VbVhR(v_quants, vlut_cvt, 0);
HVX_Vector v_hf = Q6_V_lo_W(vp);
return Q6_Vhf_equals_Vqf16(Q6_Vqf16_vmpy_VhfVhf(v_hf, v_scales));
}
static inline HVX_Vector_x2 dequantize_x4x2_iq4_nl_x4groups_hvx(
const uint8_t *packed_128, bool upper_nibbles,
const __fp16 *scales_4, const HVX_Vector vlut_cvt) {
HVX_Vector vq = hvx_vmemu(packed_128);
const HVX_Vector mask_h4 = Q6_Vb_vsplat_R(0x0F);
HVX_Vector v_quants = Q6_Vub_vlsr_VubR(vq, 4 * upper_nibbles);
v_quants = Q6_V_vand_VV(v_quants, mask_h4);
v_quants = Q6_Vb_vshuff_Vb(v_quants);
HVX_VectorPair vp = Q6_Wh_vlut16_VbVhR(v_quants, vlut_cvt, 0);
HVX_Vector v_lo = Q6_V_lo_W(vp);
HVX_Vector v_hi = Q6_V_hi_W(vp);
HVX_Vector vscale = hvx_vmemu(scales_4);
HVX_Vector v_sc01 = hvx_vec_repl_2x_f16(vscale);
HVX_Vector v_sc23 = hvx_vec_repl_2x_f16(Q6_V_vror_VR(vscale, 4));
v_lo = Q6_Vhf_equals_Vqf16(Q6_Vqf16_vmpy_VhfVhf(v_lo, v_sc01));
v_hi = Q6_Vhf_equals_Vqf16(Q6_Vqf16_vmpy_VhfVhf(v_hi, v_sc23));
HVX_Vector_x2 r = { v_lo, v_hi };
return r;
}
// Dequantize one x4x2 Q8_0 group (32 int8 quants) -> 32 FP16 in first 64 bytes.
static inline HVX_Vector dequantize_x4x2_q8_0_group_hvx(const int8_t *quants_32, const __fp16 *scale) {
HVX_Vector vq = hvx_vmemu(quants_32);
@@ -374,122 +414,176 @@ static inline HVX_Vector_x4 dequantize_x4x2_mxfp4_x4groups_hvx(const uint8_t *
return r;
}
typedef struct {
__fp16 *dst;
const uint8_t *src;
int n_cols;
int k_block;
size_t row_stride;
int weight_type;
int n_tot_tiles;
int n_tiles_per_task;
int n_tasks;
int n_k_tiles;
struct fastdiv_values n_k_tiles_div;
} x4x2_dequantize_state_t;
// Dequantize a tile range from x4x2 weight data (already in VTCM) to tile-major FP16.
// Input: vtcm_src has n_cols rows of x4x2 data, each row_stride bytes.
// Output: vtcm_dst in tile-major FP16 layout.
static void dequantize_x4x2_weight_to_fp16_tiles_task(
__fp16 *restrict vtcm_dst,
const uint8_t *restrict vtcm_src,
int n_cols, int k_block,
size_t row_stride, int weight_type,
#define DEFINE_DEQUANTIZE_Q4_TASK(suffix, lut_name, helper_prefix, dblk_size, scale_step) \
static void dequantize_x4x2_weight_to_fp16_tiles_task_##suffix( \
const x4x2_dequantize_state_t *state, \
int start_tile, int end_tile) { \
\
const int n_k_tiles = state->n_k_tiles; \
const int qrow_size = (unsigned)state->k_block / 2; \
const struct fastdiv_values n_k_tiles_div = state->n_k_tiles_div; \
const HVX_Vector vlut_cvt = hvx_vmem(lut_name); \
\
const HVX_Vector v_scat_base = hvx_vmem(hmx_transpose_scatter_offsets); \
const HVX_Vector v_scat_step = Q6_V_vsplat_R(4); \
const HVX_VectorPred q_mask64 = Q6_Q_vsetq_R(64); \
\
unsigned ct = fastdiv((unsigned)start_tile, &n_k_tiles_div); \
unsigned kt = fastmodulo((unsigned)start_tile, n_k_tiles, &n_k_tiles_div); \
\
for (unsigned t = start_tile; t < (unsigned)end_tile; ) { \
if (kt >= (unsigned)n_k_tiles) { kt = 0; ct++; } \
\
if ((kt % 4 == 0) && (t + 4 <= (unsigned)end_tile) && (fastdiv(t + 3, &n_k_tiles_div) == ct)) { \
unsigned blk_idx = ((kt * 32) / QK_Q4_0x4x2); \
unsigned sub_blk_base = ((kt * 32) % QK_Q4_0x4x2) / 32; \
bool upper = (sub_blk_base >= 4); \
unsigned packed_off = blk_idx * (QK_Q4_0x4x2 / 2); \
unsigned scale_off = qrow_size + blk_idx * (dblk_size) + sub_blk_base * (scale_step); \
\
__fp16 *tile_bases[4]; \
for (unsigned g = 0; g < 4; g++) { \
tile_bases[g] = state->dst + (t + g) * HMX_FP16_TILE_N_ELMS; \
} \
\
HVX_Vector v_off = v_scat_base; \
unsigned row_offset = ct * HMX_FP16_TILE_N_COLS * state->row_stride; \
\
for (int r = 0; r < HMX_FP16_TILE_N_ROWS; r += 2) { \
const uint8_t *r0 = state->src + row_offset; row_offset += state->row_stride; \
const uint8_t *r1 = state->src + row_offset; row_offset += state->row_stride; \
\
HVX_Vector_x2 dv0 = dequantize_x4x2_##helper_prefix##_x4groups_hvx( \
r0 + packed_off, upper, (const __fp16 *)(r0 + scale_off), vlut_cvt); \
Q6_vscatter_RMVwV((size_t)tile_bases[0], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv0.v[0]); \
Q6_vscatter_RMVwV((size_t)tile_bases[2], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv0.v[1]); \
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step); \
\
HVX_Vector_x2 dv1 = dequantize_x4x2_##helper_prefix##_x4groups_hvx( \
r1 + packed_off, upper, (const __fp16 *)(r1 + scale_off), vlut_cvt); \
Q6_vscatter_RMVwV((size_t)tile_bases[0], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv1.v[0]); \
Q6_vscatter_RMVwV((size_t)tile_bases[2], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv1.v[1]); \
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step); \
} \
\
for (int g = 0; g < 4; g++) { (void) *(volatile HVX_Vector *)(tile_bases[g]); } \
t += 4; kt += 4; \
continue; \
} \
\
__fp16 *tile_base = state->dst + t * HMX_FP16_TILE_N_ELMS; \
{ \
unsigned blk_idx = (kt * 32) / QK_Q4_0x4x2; \
unsigned sub_blk = ((kt * 32) % QK_Q4_0x4x2) / 32; \
bool upper = (sub_blk >= 4); \
unsigned byte_off = blk_idx * (QK_Q4_0x4x2 / 2) + (upper ? (sub_blk - 4) : sub_blk) * 32; \
unsigned scale_off = qrow_size + blk_idx * (dblk_size) + sub_blk * (scale_step); \
\
HVX_Vector v_off = v_scat_base; \
unsigned row_offset = ct * HMX_FP16_TILE_N_COLS * state->row_stride; \
unsigned row1 = ct * HMX_FP16_TILE_N_COLS + 1; \
\
for (int r = 0; r < HMX_FP16_TILE_N_ROWS; r += 2, row1 += 2) { \
const uint8_t *r0 = state->src + row_offset; row_offset += state->row_stride; \
const uint8_t *r1 = state->src + row_offset; row_offset += state->row_stride; \
\
HVX_Vector v0 = dequantize_x4x2_##helper_prefix##_group_hvx( \
r0 + byte_off, upper, (const __fp16 *)(r0 + scale_off), vlut_cvt); \
HVX_Vector v1 = (row1 < (unsigned)state->n_cols) \
? dequantize_x4x2_##helper_prefix##_group_hvx( \
r1 + byte_off, upper, (const __fp16 *)(r1 + scale_off), vlut_cvt) \
: Q6_V_vzero(); \
\
Q6_vscatter_QRMVwV(q_mask64, (size_t)tile_base, HMX_FP16_TILE_SIZE - 1, v_off, v0); \
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step); \
Q6_vscatter_QRMVwV(q_mask64, (size_t)tile_base, HMX_FP16_TILE_SIZE - 1, v_off, v1); \
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step); \
} \
(void) *(volatile HVX_Vector *)(tile_base); \
} \
++t; ++kt; \
} \
\
if (start_tile < end_tile) { \
(void) *(volatile HVX_Vector *)(state->dst + (end_tile - 1) * HMX_FP16_TILE_N_ELMS); \
} \
} \
\
static void dequantize_x4x2_worker_loop_##suffix(unsigned int n, unsigned int i, void *data) { \
x4x2_dequantize_state_t *state = (x4x2_dequantize_state_t *)data; \
for (unsigned int task_id = i; task_id < (unsigned int)state->n_tasks; task_id += n) { \
int start = task_id * state->n_tiles_per_task; \
int end = hex_smin(start + state->n_tiles_per_task, state->n_tot_tiles); \
dequantize_x4x2_weight_to_fp16_tiles_task_##suffix(state, start, end); \
} \
}
DEFINE_DEQUANTIZE_Q4_TASK(q4_0, q4_0_to_fp16_lut, q4_0, HMX_X4X2_DBLK_SIZE, (int)sizeof(__fp16))
DEFINE_DEQUANTIZE_Q4_TASK(q4_1, q4_1_to_fp16_lut, q4_1, 32, 4)
DEFINE_DEQUANTIZE_Q4_TASK(iq4_nl, iq4_nl_to_fp16_lut, iq4_nl, HMX_X4X2_DBLK_SIZE, (int)sizeof(__fp16))
static void dequantize_x4x2_weight_to_fp16_tiles_task_mxfp4(
const x4x2_dequantize_state_t *state,
int start_tile, int end_tile) {
const int n_k_tiles = (unsigned)k_block / HMX_FP16_TILE_N_COLS;
const bool is_q4 = (weight_type == HTP_TYPE_Q4_0 || weight_type == HTP_TYPE_Q4_1 || weight_type == HTP_TYPE_IQ4_NL);
const bool is_q4_1 = (weight_type == HTP_TYPE_Q4_1);
const int qrow_size = is_q4 ? ((unsigned)k_block / 2) : k_block;
const int n_k_tiles = state->n_k_tiles;
const int qrow_size = state->k_block;
const struct fastdiv_values n_k_tiles_div = state->n_k_tiles_div;
const HVX_Vector vlut_cvt = hvx_vmem(mxfp4_to_fp16_lut);
const HVX_Vector vlut_cvt = (weight_type == HTP_TYPE_IQ4_NL) ? hvx_vmem(iq4_nl_to_fp16_lut) :
(weight_type == HTP_TYPE_MXFP4) ? hvx_vmem(mxfp4_to_fp16_lut) :
(weight_type == HTP_TYPE_Q4_1) ? hvx_vmem(q4_1_to_fp16_lut) :
hvx_vmem(q4_0_to_fp16_lut);
// vscatter setup: write dequantized K-values directly to transposed [K][N] tile positions.
// Each int32 element holds a K-row-pair (2 adjacent fp16 values). word[i] at offset i*128
// maps to K-rows 2i and 2i+1. Column offset (n*4) added per row.
const HVX_Vector v_scat_base = hvx_vmem(hmx_transpose_scatter_offsets);
const HVX_Vector v_scat_step = Q6_V_vsplat_R(4); // 4 bytes = 1 column step
const HVX_VectorPred q_mask64 = Q6_Q_vsetq_R(64); // first 16 words (64 bytes)
const HVX_Vector v_scat_step = Q6_V_vsplat_R(4);
const HVX_VectorPred q_mask64 = Q6_Q_vsetq_R(64);
unsigned ct = (unsigned)start_tile / n_k_tiles; // column tile index
unsigned kt = (unsigned)start_tile % n_k_tiles; // K tile index
for (unsigned t = start_tile; t < end_tile; ) {
if (kt >= n_k_tiles) { kt = 0; ct++; }
unsigned ct = fastdiv((unsigned)start_tile, &n_k_tiles_div);
unsigned kt = fastmodulo((unsigned)start_tile, n_k_tiles, &n_k_tiles_div);
// --- Batch-4 fast path for Q4: process 4 contiguous K-tiles with one vlut16 per row ---
if (is_q4 && (kt % 4 == 0) && (t + 4 <= end_tile) && ((t + 3) / n_k_tiles == ct)) {
unsigned blk_idx = (kt * 32) / QK_Q4_0x4x2;
unsigned sub_blk_base = ((kt * 32) % QK_Q4_0x4x2) / 32; // 0 or 4
bool upper = (sub_blk_base >= 4);
unsigned packed_off = blk_idx * (QK_Q4_0x4x2 / 2); // 128 contiguous packed bytes
unsigned dblk_size = is_q4_1 ? 32 : HMX_X4X2_DBLK_SIZE;
unsigned scale_step = is_q4_1 ? 4 : (int)sizeof(__fp16);
unsigned scale_off = qrow_size + blk_idx * dblk_size
+ sub_blk_base * scale_step;
for (unsigned t = start_tile; t < (unsigned)end_tile; ) {
if (kt >= (unsigned)n_k_tiles) { kt = 0; ct++; }
__fp16 *tile_bases[4];
for (unsigned g = 0; g < 4; g++) { tile_bases[g] = vtcm_dst + (t + g) * HMX_FP16_TILE_N_ELMS; }
HVX_Vector v_off = v_scat_base;
unsigned row_offset = ct * HMX_FP16_TILE_N_COLS * row_stride;
unsigned row1 = ct * HMX_FP16_TILE_N_COLS + 1;
if (is_q4_1) {
for (int r = 0; r < HMX_FP16_TILE_N_ROWS; r += 2, row1 += 2) {
const uint8_t *r0 = vtcm_src + row_offset; row_offset += row_stride;
const uint8_t *r1 = vtcm_src + row_offset; row_offset += row_stride;
HVX_Vector_x2 dv0 = dequantize_x4x2_q4_1_x4groups_hvx(r0 + packed_off, upper, (const __fp16 *)(r0 + scale_off), vlut_cvt);
HVX_Vector_x2 dv1 = dequantize_x4x2_q4_1_x4groups_hvx(r1 + packed_off, upper, (const __fp16 *)(r1 + scale_off), vlut_cvt);
Q6_vscatter_RMVwV((size_t)tile_bases[0], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv0.v[0]);
Q6_vscatter_RMVwV((size_t)tile_bases[2], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv0.v[1]);
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
Q6_vscatter_RMVwV((size_t)tile_bases[0], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv1.v[0]);
Q6_vscatter_RMVwV((size_t)tile_bases[2], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv1.v[1]);
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
}
} else {
for (int r = 0; r < HMX_FP16_TILE_N_ROWS; r += 2, row1 += 2) {
const uint8_t *r0 = vtcm_src + row_offset; row_offset += row_stride;
const uint8_t *r1 = vtcm_src + row_offset; row_offset += row_stride;
HVX_Vector_x2 dv0 = dequantize_x4x2_q4_0_x4groups_hvx(r0 + packed_off, upper, (const __fp16 *)(r0 + scale_off), vlut_cvt);
HVX_Vector_x2 dv1 = dequantize_x4x2_q4_0_x4groups_hvx(r1 + packed_off, upper, (const __fp16 *)(r1 + scale_off), vlut_cvt);
Q6_vscatter_RMVwV((size_t)tile_bases[0], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv0.v[0]);
Q6_vscatter_RMVwV((size_t)tile_bases[2], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv0.v[1]);
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
Q6_vscatter_RMVwV((size_t)tile_bases[0], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv1.v[0]);
Q6_vscatter_RMVwV((size_t)tile_bases[2], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv1.v[1]);
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
}
}
for (int g = 0; g < 4; g++) { (void) *(volatile HVX_Vector *)(tile_bases[g]); }
t += 4; kt += 4;
continue;
}
// --- Batch-4 fast path for MXFP4: same nibble layout but E8M0 scales ---
if (weight_type == HTP_TYPE_MXFP4 && (kt % 4 == 0) && (t + 4 <= end_tile) && ((t + 3) / n_k_tiles == ct)) {
// Batch-4 fast path for MXFP4
if ((kt % 4 == 0) && (t + 4 <= (unsigned)end_tile) && (fastdiv(t + 3, &n_k_tiles_div) == ct)) {
int blk_idx = (kt * 32) / QK_MXFP4x4x2;
int sub_blk_base = ((kt * 32) % QK_MXFP4x4x2) / 32; // 0 or 4
int sub_blk_base = ((kt * 32) % QK_MXFP4x4x2) / 32;
bool upper = (sub_blk_base >= 4);
int packed_off = blk_idx * (QK_MXFP4x4x2 / 2); // 128 contiguous packed bytes
int e8m0_blk_off = qrow_size + blk_idx * HMX_X4X2_MXFP4_EBLK_SIZE; // all 8 E8M0 scales
int packed_off = blk_idx * (QK_MXFP4x4x2 / 2);
int e8m0_blk_off = qrow_size + blk_idx * HMX_X4X2_MXFP4_EBLK_SIZE;
__fp16 * tile_bases[4];
for (int g = 0; g < 4; g++) {
tile_bases[g] = vtcm_dst + (t + g) * HMX_FP16_TILE_N_ELMS;
tile_bases[g] = state->dst + (t + g) * HMX_FP16_TILE_N_ELMS;
}
HVX_Vector v_off = v_scat_base;
for (int r = 0; r < HMX_FP16_TILE_N_ROWS; r += 2) {
int row0 = ct * HMX_FP16_TILE_N_COLS + r;
int row1 = row0 + 1;
const uint8_t * r0 = vtcm_src + row0 * row_stride;
const uint8_t * r1 = vtcm_src + row1 * row_stride;
const uint8_t * r0 = state->src + row0 * state->row_stride;
const uint8_t * r1 = state->src + row1 * state->row_stride;
// Batch-convert all 8 E8M0 scales once per row (stays in HVX register)
mxfp4_scales_t r0_e8 = mxfp4_convert_scales(r0 + e8m0_blk_off);
HVX_Vector_x4 dv0, dv1;
dv0 = dequantize_x4x2_mxfp4_x4groups_hvx(r0 + packed_off, upper, sub_blk_base, vlut_cvt, r0_e8);
if (row1 < n_cols) {
if (row1 < state->n_cols) {
mxfp4_scales_t r1_e8 = mxfp4_convert_scales(r1 + e8m0_blk_off);
dv1 = dequantize_x4x2_mxfp4_x4groups_hvx(r1 + packed_off, upper, sub_blk_base, vlut_cvt, r1_e8);
} else {
@@ -510,58 +604,13 @@ static void dequantize_x4x2_weight_to_fp16_tiles_task(
(void) *(volatile HVX_Vector *) (tile_bases[g]);
}
t += 4;
t += 4; kt += 4;
continue;
}
// --- Single-tile fallback ---
__fp16 *tile_base = vtcm_dst + t * HMX_FP16_TILE_N_ELMS;
if (is_q4) {
unsigned blk_idx = (kt * 32) / QK_Q4_0x4x2;
unsigned sub_blk = ((kt * 32) % QK_Q4_0x4x2) / 32;
bool upper = (sub_blk >= 4);
unsigned byte_off = blk_idx * (QK_Q4_0x4x2 / 2) + (upper ? (sub_blk - 4) : sub_blk) * 32;
unsigned dblk_size = is_q4_1 ? 32 : HMX_X4X2_DBLK_SIZE;
unsigned scale_step = is_q4_1 ? 4 : (int)sizeof(__fp16);
unsigned scale_off = qrow_size + blk_idx * dblk_size + sub_blk * scale_step;
HVX_Vector v_off = v_scat_base; // reset to column 0
unsigned row_offset = ct * HMX_FP16_TILE_N_COLS * row_stride;
unsigned row1 = ct * HMX_FP16_TILE_N_COLS + 1;
if (is_q4_1) {
for (int r = 0; r < HMX_FP16_TILE_N_ROWS; r += 2, row1 += 2) {
const uint8_t *r0 = vtcm_src + row_offset; row_offset += row_stride;
const uint8_t *r1 = vtcm_src + row_offset; row_offset += row_stride;
HVX_Vector v0 = dequantize_x4x2_q4_1_group_hvx(r0 + byte_off, upper, (const __fp16 *)(r0 + scale_off), vlut_cvt);
HVX_Vector v1 = (row1 < n_cols)
? dequantize_x4x2_q4_1_group_hvx(r1 + byte_off, upper, (const __fp16 *)(r1 + scale_off), vlut_cvt)
: Q6_V_vzero();
Q6_vscatter_QRMVwV(q_mask64, (size_t)tile_base, HMX_FP16_TILE_SIZE - 1, v_off, v0);
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
Q6_vscatter_QRMVwV(q_mask64, (size_t)tile_base, HMX_FP16_TILE_SIZE - 1, v_off, v1);
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
}
} else {
for (int r = 0; r < HMX_FP16_TILE_N_ROWS; r += 2, row1 += 2) {
const uint8_t *r0 = vtcm_src + row_offset; row_offset += row_stride;
const uint8_t *r1 = vtcm_src + row_offset; row_offset += row_stride;
HVX_Vector v0 = dequantize_x4x2_q4_0_group_hvx(r0 + byte_off, upper, (const __fp16 *)(r0 + scale_off), vlut_cvt);
HVX_Vector v1 = (row1 < n_cols)
? dequantize_x4x2_q4_0_group_hvx(r1 + byte_off, upper, (const __fp16 *)(r1 + scale_off), vlut_cvt)
: Q6_V_vzero();
Q6_vscatter_QRMVwV(q_mask64, (size_t)tile_base, HMX_FP16_TILE_SIZE - 1, v_off, v0);
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
Q6_vscatter_QRMVwV(q_mask64, (size_t)tile_base, HMX_FP16_TILE_SIZE - 1, v_off, v1);
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
}
}
(void) *(volatile HVX_Vector *)(tile_base);
} else if (weight_type == HTP_TYPE_MXFP4) {
// Single-tile fallback
__fp16 *tile_base = state->dst + t * HMX_FP16_TILE_N_ELMS;
{
int blk_idx = (kt * 32) / QK_MXFP4x4x2;
int sub_blk = ((kt * 32) % QK_MXFP4x4x2) / 32;
bool upper = (sub_blk >= 4);
@@ -573,15 +622,14 @@ static void dequantize_x4x2_weight_to_fp16_tiles_task(
int row0 = ct * HMX_FP16_TILE_N_COLS + r;
int row1 = row0 + 1;
const uint8_t * r0 = vtcm_src + row0 * row_stride;
const uint8_t * r1 = vtcm_src + row1 * row_stride;
const uint8_t * r0 = state->src + row0 * state->row_stride;
const uint8_t * r1 = state->src + row1 * state->row_stride;
// Batch-convert all 8 E8M0 scales once per row (stays in HVX register)
mxfp4_scales_t r0_e8 = mxfp4_convert_scales(r0 + e8m0_blk_off);
HVX_Vector v0 = dequantize_x4x2_mxfp4_group_hvx(r0 + byte_off, upper, sub_blk, vlut_cvt, r0_e8);
HVX_Vector v1;
if (row1 < n_cols) {
if (row1 < state->n_cols) {
mxfp4_scales_t r1_e8 = mxfp4_convert_scales(r1 + e8m0_blk_off);
v1 = dequantize_x4x2_mxfp4_group_hvx(r1 + byte_off, upper, sub_blk, vlut_cvt, r1_e8);
} else {
@@ -594,23 +642,59 @@ static void dequantize_x4x2_weight_to_fp16_tiles_task(
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
}
(void) *(volatile HVX_Vector *) (tile_base);
} else {
// Q8_0
}
++t; ++kt;
}
if (start_tile < end_tile) {
(void) *(volatile HVX_Vector *)(state->dst + (end_tile - 1) * HMX_FP16_TILE_N_ELMS);
}
}
static void dequantize_x4x2_worker_loop_mxfp4(unsigned int n, unsigned int i, void *data) {
x4x2_dequantize_state_t *state = (x4x2_dequantize_state_t *)data;
for (unsigned int task_id = i; task_id < (unsigned int)state->n_tasks; task_id += n) {
int start = task_id * state->n_tiles_per_task;
int end = hex_smin(start + state->n_tiles_per_task, state->n_tot_tiles);
dequantize_x4x2_weight_to_fp16_tiles_task_mxfp4(state, start, end);
}
}
static void dequantize_x4x2_weight_to_fp16_tiles_task_q8_0(
const x4x2_dequantize_state_t *state,
int start_tile, int end_tile) {
const int n_k_tiles = state->n_k_tiles;
const int qrow_size = state->k_block;
const struct fastdiv_values n_k_tiles_div = state->n_k_tiles_div;
const HVX_Vector v_scat_base = hvx_vmem(hmx_transpose_scatter_offsets);
const HVX_Vector v_scat_step = Q6_V_vsplat_R(4);
const HVX_VectorPred q_mask64 = Q6_Q_vsetq_R(64);
unsigned ct = fastdiv((unsigned)start_tile, &n_k_tiles_div);
unsigned kt = fastmodulo((unsigned)start_tile, n_k_tiles, &n_k_tiles_div);
for (unsigned t = start_tile; t < (unsigned)end_tile; ) {
if (kt >= (unsigned)n_k_tiles) { kt = 0; ct++; }
__fp16 *tile_base = state->dst + t * HMX_FP16_TILE_N_ELMS;
{
int blk_idx = (kt * 32) / QK_Q8_0x4x2;
int sub_blk = ((kt * 32) % QK_Q8_0x4x2) / 32;
int byte_off = blk_idx * QK_Q8_0x4x2 + sub_blk * 32;
int scale_off = qrow_size + blk_idx * HMX_X4X2_DBLK_SIZE + sub_blk * (int)sizeof(__fp16);
HVX_Vector v_off = v_scat_base; // reset to column 0
HVX_Vector v_off = v_scat_base;
for (int r = 0; r < HMX_FP16_TILE_N_ROWS; r += 2) {
int row0 = ct * HMX_FP16_TILE_N_COLS + r;
int row1 = row0 + 1;
const uint8_t *r0 = vtcm_src + row0 * row_stride;
const uint8_t *r1 = vtcm_src + row1 * row_stride;
const uint8_t *r0 = state->src + row0 * state->row_stride;
const uint8_t *r1 = state->src + row1 * state->row_stride;
HVX_Vector v0 = dequantize_x4x2_q8_0_group_hvx((const int8_t *)(r0 + byte_off), (const __fp16 *)(r0 + scale_off));
HVX_Vector v1 = (row1 < n_cols) ? dequantize_x4x2_q8_0_group_hvx((const int8_t *)(r1 + byte_off), (const __fp16 *)(r1 + scale_off)) : Q6_V_vzero();
HVX_Vector v1 = (row1 < state->n_cols) ? dequantize_x4x2_q8_0_group_hvx((const int8_t *)(r1 + byte_off), (const __fp16 *)(r1 + scale_off)) : Q6_V_vzero();
Q6_vscatter_QRMVwV(q_mask64, (size_t)tile_base, HMX_FP16_TILE_SIZE - 1, v_off, v0);
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
@@ -622,50 +706,31 @@ static void dequantize_x4x2_weight_to_fp16_tiles_task(
++t; ++kt;
}
// Drain HVX scatter write buffer: a vmem load on the same HW thread retires
// all pending scatter entries to VTCM. Without this, the main thread's HMX
// reads may see stale data because atomic_fetch_sub (release) only orders
// regular stores, not the HVX scatter buffer.
if (start_tile < end_tile) {
(void) *(volatile HVX_Vector *)(vtcm_dst + (end_tile - 1) * HMX_FP16_TILE_N_ELMS);
(void) *(volatile HVX_Vector *)(state->dst + (end_tile - 1) * HMX_FP16_TILE_N_ELMS);
}
}
typedef struct {
__fp16 *dst;
const uint8_t *src;
int n_cols;
int k_block;
size_t row_stride;
int weight_type;
int n_tot_tiles;
int n_tiles_per_task;
int n_tasks;
} x4x2_dequantize_state_t;
static void dequantize_x4x2_worker_loop(unsigned int n, unsigned int i, void *data) {
static void dequantize_x4x2_worker_loop_q8_0(unsigned int n, unsigned int i, void *data) {
x4x2_dequantize_state_t *state = (x4x2_dequantize_state_t *)data;
for (unsigned int task_id = i; task_id < (unsigned int)state->n_tasks; task_id += n) {
int start = task_id * state->n_tiles_per_task;
int end = hex_smin(start + state->n_tiles_per_task, state->n_tot_tiles);
dequantize_x4x2_weight_to_fp16_tiles_task(
state->dst, state->src, state->n_cols, state->k_block,
state->row_stride, state->weight_type, start, end);
dequantize_x4x2_weight_to_fp16_tiles_task_q8_0(state, start, end);
}
}
static void dequantize_x4x2_weight_chunk_to_fp16_tiles(
struct htp_context *ctx, __fp16 *vtcm_dst,
const void *vtcm_src, int n_cols, int k_block,
size_t row_stride, int weight_type) {
size_t row_stride, int weight_type,
int n_k_tiles, struct fastdiv_values n_k_tiles_div,
worker_callback_t dequant_worker_fn) {
assert(n_cols % HMX_FP16_TILE_N_COLS == 0);
assert(k_block % HMX_FP16_TILE_N_COLS == 0);
size_t n_col_tiles = n_cols / HMX_FP16_TILE_N_COLS;
size_t n_k_tiles = k_block / HMX_FP16_TILE_N_COLS;
size_t n_tot_tiles = n_col_tiles * n_k_tiles;
size_t n_tiles_per_task = hmx_ceil_div(n_tot_tiles, ctx->n_threads);
@@ -680,8 +745,10 @@ static void dequantize_x4x2_weight_chunk_to_fp16_tiles(
state.k_block = k_block;
state.row_stride = row_stride;
state.weight_type = weight_type;
state.n_k_tiles = n_k_tiles;
state.n_k_tiles_div = n_k_tiles_div;
worker_pool_run_func(ctx->worker_pool, dequantize_x4x2_worker_loop, &state, ctx->n_threads);
worker_pool_run_func(ctx->worker_pool, dequant_worker_fn, &state, ctx->n_threads);
}
// --- End x4x2 dequantizers ---
@@ -978,6 +1045,20 @@ int hmx_matmul_q_f32(struct htp_context *ctx, float *restrict dst, const float *
return -1;
}
worker_callback_t dequant_worker_fn = NULL;
switch (weight_type) {
case HTP_TYPE_Q4_0: dequant_worker_fn = dequantize_x4x2_worker_loop_q4_0; break;
case HTP_TYPE_IQ4_NL: dequant_worker_fn = dequantize_x4x2_worker_loop_iq4_nl; break;
case HTP_TYPE_Q4_1: dequant_worker_fn = dequantize_x4x2_worker_loop_q4_1; break;
case HTP_TYPE_MXFP4: dequant_worker_fn = dequantize_x4x2_worker_loop_mxfp4; break;
case HTP_TYPE_Q8_0: dequant_worker_fn = dequantize_x4x2_worker_loop_q8_0; break;
default:
return -1;
}
const int n_k_tiles = k / HMX_FP16_TILE_N_COLS;
const struct fastdiv_values n_k_tiles_div = init_fastdiv_values(n_k_tiles);
// --- Dynamic VTCM layout ---
const size_t vec_dot_size = k * sizeof(__fp16);
const size_t vtcm_budget = ctx->vtcm_size;
@@ -1070,7 +1151,7 @@ int hmx_matmul_q_f32(struct htp_context *ctx, float *restrict dst, const float *
{
// B0: wait for DMA, dequant weight chunk 0
dma_queue_pop(ctx->dma[0]);
dequantize_x4x2_weight_chunk_to_fp16_tiles(ctx, vtcm_weight_bufs[0], vtcm_qweight, n_cols_A0, k, row_stride, weight_type);
dequantize_x4x2_weight_chunk_to_fp16_tiles(ctx, vtcm_weight_bufs[0], vtcm_qweight, n_cols_A0, k, row_stride, weight_type, n_k_tiles, n_k_tiles_div, dequant_worker_fn);
// A1: issue DMA for weight chunk 1
const size_t n_cols_A1 = hex_smin(n - 1 * n_chunk_n_cols, n_chunk_n_cols);
@@ -1089,7 +1170,7 @@ int hmx_matmul_q_f32(struct htp_context *ctx, float *restrict dst, const float *
// B1: DMA pop + dequant (runs in parallel with C0 on HMX worker)
if (1 < n_chunk_cnt) {
dma_queue_pop(ctx->dma[0]);
dequantize_x4x2_weight_chunk_to_fp16_tiles(ctx, vtcm_weight_bufs[1], vtcm_qweight, n_cols_A1, k, row_stride, weight_type);
dequantize_x4x2_weight_chunk_to_fp16_tiles(ctx, vtcm_weight_bufs[1], vtcm_qweight, n_cols_A1, k, row_stride, weight_type, n_k_tiles, n_k_tiles_div, dequant_worker_fn);
}
}
@@ -1131,7 +1212,7 @@ int hmx_matmul_q_f32(struct htp_context *ctx, float *restrict dst, const float *
// B_{i+2}: DMA pop + dequant (multi-thread HVX, parallel with C_{i+1})
if (i + 2 < n_chunk_cnt) {
dma_queue_pop(ctx->dma[0]);
dequantize_x4x2_weight_chunk_to_fp16_tiles(ctx, vtcm_weight_bufs[(i + 2) % 2], vtcm_qweight, n_cols_p2, k, row_stride, weight_type);
dequantize_x4x2_weight_chunk_to_fp16_tiles(ctx, vtcm_weight_bufs[(i + 2) % 2], vtcm_qweight, n_cols_p2, k, row_stride, weight_type, n_k_tiles, n_k_tiles_div, dequant_worker_fn);
}
}
}
+1
View File
@@ -58,6 +58,7 @@ enum htp_op_code {
HTP_OP_MUL_MAT,
HTP_OP_MUL_MAT_ID,
HTP_OP_RMS_NORM,
HTP_OP_RMS_NORM_MUL,
HTP_OP_UNARY_SILU,
HTP_OP_UNARY_GELU,
HTP_OP_UNARY_SIGMOID,
+1
View File
@@ -537,6 +537,7 @@ static int execute_op(struct htp_ops_context * octx) {
case HTP_OP_NORM:
case HTP_OP_RMS_NORM:
case HTP_OP_RMS_NORM_MUL:
case HTP_OP_SCALE:
case HTP_OP_SQR:
case HTP_OP_SQRT:
+188 -6
View File
@@ -23,21 +23,26 @@ struct htp_unary_context {
// Precomputed values
const uint8_t * data_src0;
const uint8_t * data_src1; // weight/scale tensor for RMS_NORM_MUL
uint8_t * data_dst;
size_t src0_data_row_size; // actual data bytes per row
size_t src1_data_row_size;
size_t dst_data_row_size; // actual data bytes per row
size_t src0_row_size_aligned;
size_t src1_row_size_aligned;
size_t dst_row_size_aligned;
size_t src0_spad_half_size;
size_t src1_spad_half_size;
size_t dst_spad_half_size;
uint32_t block;
uint32_t src0_nrows;
uint32_t src0_nrows_per_thread;
uint32_t nc;
bool broadcast_weight;
};
// Convert flat row index to DDR byte offset using the tensor's actual strides.
@@ -158,6 +163,71 @@ static void hvx_fast_rms_norm_f32(const uint8_t * restrict src,
}
}
static void hvx_fast_rms_norm_mul_f32(const uint8_t * restrict src,
const uint8_t * restrict weight,
uint8_t * restrict dst,
const int num_elems,
float epsilon) {
const HVX_Vector * restrict v_src = (const HVX_Vector *) src;
const HVX_Vector * restrict v_weight = (const HVX_Vector *) weight;
HVX_Vector * restrict v_dst = (HVX_Vector *) dst;
const int nvec = num_elems / VLEN_FP32; // number of full vectors
const int nloe = num_elems % VLEN_FP32; // leftover elements
// Compute sum of squares for full vectors
HVX_Vector sum_v = Q6_V_vsplat_R(0x00000000);
HVX_Vector epsilon_v = hvx_vec_splat_f32(epsilon);
#pragma unroll(4)
for (int i = 0; i < nvec; i++) {
HVX_Vector v1 = v_src[i];
HVX_Vector v2 = Q6_Vqf32_vmpy_VsfVsf(v1, v1);
sum_v = Q6_Vqf32_vadd_Vqf32Vqf32(sum_v, v2);
}
// Handle tail elements using vectorized ops with masking
if (nloe > 0) {
HVX_VectorPred bmask = Q6_Q_vsetq_R(nloe * 4);
HVX_Vector v1 = Q6_V_vand_QV(bmask, v_src[nvec]);
HVX_Vector v2 = Q6_Vqf32_vmpy_VsfVsf(v1, v1);
sum_v = Q6_Vqf32_vadd_Vqf32Vqf32(sum_v, v2);
}
// Reduce HVX sum
sum_v = hvx_vec_reduce_sum_f32(Q6_Vsf_equals_Vqf32(sum_v));
HVX_Vector t_v = hvx_vec_splat_f32((float) num_elems);
HVX_Vector denom_v = hvx_vec_inverse_f32(t_v);
HVX_Vector mean_v = Q6_Vqf32_vmpy_VsfVsf(sum_v, denom_v);
HVX_Vector mean_epsilon_v = Q6_Vqf32_vadd_Vqf32Vsf(mean_v, epsilon_v);
// Scale and multiply
HVX_Vector scale_v = hvx_vec_rsqrt_f32(Q6_Vsf_equals_Vqf32(mean_epsilon_v));
#pragma unroll(4)
for (int i = 0; i < nvec; i++) {
HVX_Vector v1 = v_src[i];
HVX_Vector v2 = Q6_Vqf32_vmpy_VsfVsf(v1, scale_v);
HVX_Vector v3 = Q6_Vsf_equals_Vqf32(v2);
HVX_Vector result = Q6_Vqf32_vmpy_VsfVsf(v3, v_weight[i]);
v_dst[i] = Q6_Vsf_equals_Vqf32(result);
}
// Handle tail elements using vectorized ops with masking
if (nloe > 0) {
HVX_VectorPred bmask = Q6_Q_vsetq_R(nloe * 4);
HVX_Vector v1 = Q6_V_vand_QV(bmask, v_src[nvec]);
HVX_Vector v2 = Q6_Vqf32_vmpy_VsfVsf(v1, scale_v);
HVX_Vector v3 = Q6_Vsf_equals_Vqf32(v2);
HVX_Vector result = Q6_Vqf32_vmpy_VsfVsf(v3, v_weight[nvec]);
HVX_Vector res_v = Q6_Vsf_equals_Vqf32(result);
// Store with masking to avoid overwriting memory beyond the tensor
hvx_vec_store_a(&v_dst[nvec], nloe * 4, res_v);
}
}
static void hvx_fast_norm_f32(const uint8_t * restrict src,
uint8_t * restrict dst,
uint8_t * restrict pad,
@@ -269,6 +339,27 @@ static void rms_norm_f32(const float * restrict src,
}
}
static void rms_norm_mul_f32(const float * restrict src,
const float * restrict weight,
float * restrict dst,
const uint32_t num_rows,
const uint32_t row_elems,
const size_t row_size,
const size_t weight_row_size,
int32_t * op_params,
bool broadcast_weight) {
float epsilon = 0.f;
memcpy(&epsilon, op_params, sizeof(float));
for (uint32_t ir = 0; ir < num_rows; ir++) {
const uint8_t * restrict src_local = (const uint8_t *)src + (ir * row_size);
const uint8_t * restrict w_local = (const uint8_t *)weight + (broadcast_weight ? 0 : ir * weight_row_size);
uint8_t * restrict dst_local = (uint8_t *)dst + (ir * row_size);
hvx_fast_rms_norm_mul_f32(src_local, w_local, dst_local, row_elems, epsilon);
}
}
static void norm_f32(const float * restrict src,
float * restrict dst,
uint8_t * restrict spad,
@@ -598,12 +689,15 @@ static void unary_job_f32_per_thread(unsigned int nth, unsigned int ith, void *
t1 = HAP_perf_get_qtimer_count();
const uint8_t * restrict data_src = uctx->data_src0;
const uint8_t * restrict data_src1 = uctx->data_src1;
uint8_t * restrict data_dst = uctx->data_dst;
uint8_t * src0_spad_data = octx->src0_spad.data + (ith * octx->src0_spad.size_per_thread);
uint8_t * src1_spad_data = octx->src1_spad.data + (ith * octx->src1_spad.size_per_thread);
uint8_t * dst_spad_data = octx->dst_spad.data + (ith * octx->dst_spad.size_per_thread);
size_t src0_spad_half_size = uctx->src0_spad_half_size;
size_t src1_spad_half_size = uctx->src1_spad_half_size;
size_t dst_spad_half_size = uctx->dst_spad_half_size;
// Non-contiguous tensors have gaps at dim-2/3 boundaries that a single-stride
@@ -624,6 +718,12 @@ static void unary_job_f32_per_thread(unsigned int nth, unsigned int ith, void *
dma_queue * dma_queue = octx->ctx->dma[ith];
// If weight is broadcasted, load it once per thread at the beginning of execution
if (htp_op == HTP_OP_RMS_NORM_MUL && uctx->broadcast_weight) {
dma_queue_push(dma_queue, dma_make_ptr(src1_spad_data, data_src1), uctx->src1_row_size_aligned, 0, uctx->src1_data_row_size, 1);
dma_queue_flush(dma_queue);
}
for (uint32_t ir = src0_start_row, spad_idx = 0; ir < src0_end_row && spad_idx < 2; spad_idx++) {
const uint32_t block_size = unary_block_size(ir, src0_end_row, BLOCK, src0_contig, dst_contig, ne01, ne1);
@@ -636,6 +736,14 @@ static void unary_job_f32_per_thread(unsigned int nth, unsigned int ith, void *
dma_queue_push(dma_queue,
dma_make_ptr(src0_spad_data + (spad_idx * src0_spad_half_size), data_src + src0_off),
src0_row_size_aligned, nb01, src0_data_row_size, block_size);
if (htp_op == HTP_OP_RMS_NORM_MUL && !uctx->broadcast_weight) {
const size_t src1_off = unary_row_offset(ir, ne01, ne02, nb01, nb02, nb03);
dma_queue_push(dma_queue,
dma_make_ptr(src1_spad_data + (spad_idx * src1_spad_half_size), data_src1 + src1_off),
uctx->src1_row_size_aligned, nb01, uctx->src1_data_row_size, block_size);
}
ir += block_size;
}
@@ -644,6 +752,10 @@ static void unary_job_f32_per_thread(unsigned int nth, unsigned int ith, void *
float * dst_spad = (float *) dma_queue_pop(dma_queue).src;
float * src0_spad = (float *) dma_queue_pop(dma_queue).dst;
float * src1_spad = NULL;
if (htp_op == HTP_OP_RMS_NORM_MUL && !uctx->broadcast_weight) {
src1_spad = (float *) dma_queue_pop(dma_queue).dst;
}
// Process block in VTCM
switch (htp_op) {
@@ -653,6 +765,12 @@ static void unary_job_f32_per_thread(unsigned int nth, unsigned int ith, void *
case HTP_OP_RMS_NORM:
rms_norm_f32(src0_spad, dst_spad, NULL, block_size, ne0, src0_row_size_aligned, op_params);
break;
case HTP_OP_RMS_NORM_MUL:
{
const float * w_ptr = uctx->broadcast_weight ? (const float *) src1_spad_data : src1_spad;
rms_norm_mul_f32(src0_spad, w_ptr, dst_spad, block_size, ne0, src0_row_size_aligned, uctx->src1_row_size_aligned, op_params, uctx->broadcast_weight);
}
break;
case HTP_OP_SCALE:
scale_f32(src0_spad, dst_spad, NULL, block_size, ne0, src0_row_size_aligned, op_params);
break;
@@ -700,9 +818,16 @@ static void unary_job_f32_per_thread(unsigned int nth, unsigned int ith, void *
if (pref_ir < src0_end_row) {
const uint32_t pref_block_size = unary_block_size(pref_ir, src0_end_row, BLOCK, src0_contig, dst_contig, ne01, ne1);
const size_t src0_pref_off = unary_row_offset(pref_ir, ne01, ne02, nb01, nb02, nb03);
dma_queue_push(dma_queue,
dma_make_ptr(src0_spad, data_src + src0_pref_off),
src0_row_size_aligned, nb01, src0_data_row_size, pref_block_size);
dma_queue_push(dma_queue,
dma_make_ptr(src0_spad, data_src + src0_pref_off),
src0_row_size_aligned, nb01, src0_data_row_size, pref_block_size);
if (htp_op == HTP_OP_RMS_NORM_MUL && !uctx->broadcast_weight) {
const size_t src1_pref_off = unary_row_offset(pref_ir, ne01, ne02, nb01, nb02, nb03);
dma_queue_push(dma_queue,
dma_make_ptr(src1_spad, data_src1 + src1_pref_off),
uctx->src1_row_size_aligned, nb01, uctx->src1_data_row_size, pref_block_size);
}
}
}
ir += block_size;
@@ -732,6 +857,9 @@ static int execute_op_unary_f32(struct htp_ops_context * octx) {
case HTP_OP_RMS_NORM:
op_type = "rmsnorm-f32";
break;
case HTP_OP_RMS_NORM_MUL:
op_type = "rmsnorm-mul-f32";
break;
case HTP_OP_SCALE:
op_type = "scale-f32";
break;
@@ -777,12 +905,44 @@ static int execute_op_unary_f32(struct htp_ops_context * octx) {
const size_t src0_row_size_aligned = hex_round_up(src0_data_row_size, VLEN);
const size_t dst_row_size_aligned = hex_round_up(dst_data_row_size, VLEN);
size_t src1_data_row_size = 0;
size_t src1_row_size_aligned = 0;
bool broadcast_weight = false;
const struct htp_tensor * src1 = NULL;
if (octx->op == HTP_OP_RMS_NORM_MUL) {
src1 = octx->src[1];
src1_data_row_size = src1->ne[0] * sizeof(float);
src1_row_size_aligned = hex_round_up(src1_data_row_size, VLEN);
broadcast_weight = (src1->ne[1] * src1->ne[2] * src1->ne[3] == 1);
}
// VTCM scratchpads for all tensors
// N rows per thread, padded to HVX vector size
// Double buffering requires 2x size per buffer
size_t spad_size_per_row = 2 * (src0_row_size_aligned + dst_row_size_aligned);
size_t vtcm_row_per_thread = (octx->ctx->vtcm_size)/ (n_threads * spad_size_per_row);
size_t spad_size_per_row = 0;
size_t vtcm_row_per_thread = 0;
if (octx->op == HTP_OP_RMS_NORM_MUL) {
if (broadcast_weight) {
size_t available_vtcm = octx->ctx->vtcm_size;
size_t src1_spad_total = n_threads * src1_row_size_aligned;
if (available_vtcm > src1_spad_total) {
available_vtcm -= src1_spad_total;
} else {
available_vtcm = 0;
}
spad_size_per_row = 2 * (src0_row_size_aligned + dst_row_size_aligned);
vtcm_row_per_thread = available_vtcm / (n_threads * spad_size_per_row);
} else {
spad_size_per_row = 2 * (src0_row_size_aligned + dst_row_size_aligned + src1_row_size_aligned);
vtcm_row_per_thread = (octx->ctx->vtcm_size) / (n_threads * spad_size_per_row);
}
} else {
spad_size_per_row = 2 * (src0_row_size_aligned + dst_row_size_aligned);
vtcm_row_per_thread = (octx->ctx->vtcm_size)/ (n_threads * spad_size_per_row);
}
// Make sure the reserved vtcm size is sufficient
if (vtcm_row_per_thread == 0) {
@@ -797,8 +957,25 @@ static int execute_op_unary_f32(struct htp_ops_context * octx) {
octx->src0_spad.size = n_threads * octx->src0_spad.size_per_thread;
octx->dst_spad.size = n_threads * octx->dst_spad.size_per_thread;
if (octx->op == HTP_OP_RMS_NORM_MUL) {
if (broadcast_weight) {
octx->src1_spad.size_per_thread = src1_row_size_aligned;
} else {
octx->src1_spad.size_per_thread = src1_row_size_aligned * vtcm_row_per_thread * 2;
}
octx->src1_spad.size = n_threads * octx->src1_spad.size_per_thread;
} else {
octx->src1_spad.size = 0;
octx->src1_spad.size_per_thread = 0;
}
octx->src0_spad.data = octx->ctx->vtcm_base;
octx->dst_spad.data = octx->src0_spad.data + octx->src0_spad.size;
if (octx->op == HTP_OP_RMS_NORM_MUL) {
octx->src1_spad.data = octx->src0_spad.data + octx->src0_spad.size;
octx->dst_spad.data = octx->src1_spad.data + octx->src1_spad.size;
} else {
octx->dst_spad.data = octx->src0_spad.data + octx->src0_spad.size;
}
FARF(HIGH, "%s: (%ux%ux%ux%u) -> (%ux%ux%ux%u) : src0-spad-size %u src1-spad-size %u dst-spad-size %u\n", op_type,
src0->ne[0], src0->ne[1], src0->ne[2], src0->ne[3], dst->ne[0], dst->ne[1], dst->ne[2], dst->ne[3],
@@ -811,19 +988,24 @@ static int execute_op_unary_f32(struct htp_ops_context * octx) {
.src0_nrows = src0_nrows,
.data_src0 = (const uint8_t *)src0->data,
.data_src1 = (octx->op == HTP_OP_RMS_NORM_MUL) ? (const uint8_t *)src1->data : NULL,
.data_dst = (uint8_t *)dst->data,
.src0_data_row_size = src0_data_row_size,
.src1_data_row_size = src1_data_row_size,
.dst_data_row_size = dst_data_row_size,
.src0_row_size_aligned = src0_row_size_aligned,
.src1_row_size_aligned = src1_row_size_aligned,
.dst_row_size_aligned = dst_row_size_aligned,
.src0_spad_half_size = octx->src0_spad.size_per_thread / 2,
.src1_spad_half_size = (octx->op == HTP_OP_RMS_NORM_MUL) ? (octx->src1_spad.size_per_thread / (broadcast_weight ? 1 : 2)) : 0,
.dst_spad_half_size = octx->dst_spad.size_per_thread / 2,
.block = (octx->src0_spad.size_per_thread / 2) / src0_row_size_aligned,
.nc = src0->ne[0],
.broadcast_weight = broadcast_weight,
};
worker_pool_run_func(octx->ctx->worker_pool, unary_job_f32_per_thread, &uctx, n_threads);
-153
View File
@@ -1,153 +0,0 @@
#ifndef OP_DESC_H
#define OP_DESC_H
#define GGML_COMMON_IMPL_CPP
#include "ggml-backend-impl.h"
#include "ggml-common.h"
#include <string>
#include <stdio.h>
struct op_desc {
char strides[64 * GGML_MAX_SRC];
char dims[64 * GGML_MAX_SRC];
char types[16 * GGML_MAX_SRC];
char buffs[64 * GGML_MAX_SRC];
char names[64 * GGML_MAX_SRC];
int format_tensor_dims(char * str, const struct ggml_tensor * t) {
if (t->ne[2] == 1 && t->ne[3] == 1) {
return sprintf(str, "%d:%d", (int) t->ne[0], (int) t->ne[1]);
} else {
return sprintf(str, "%d:%d:%d:%d", (int) t->ne[0], (int) t->ne[1], (int) t->ne[2], (int) t->ne[3]);
}
}
void format_op_dims(char * str, const struct ggml_tensor * t) {
char * p = str;
// append src0 and src1 (if any)
if (t->src[0]) {
p += format_tensor_dims(p, t->src[0]);
for (int i = 1; i < GGML_MAX_SRC && t->src[i]; i++) {
p += sprintf(p, " x ");
p += format_tensor_dims(p, t->src[i]);
}
p += sprintf(p, " -> ");
}
// format self dims separately for better visual alignment
char self[64];
format_tensor_dims(self, t);
p += sprintf(p, "%s", self);
}
int format_tensor_strides(char * str, const struct ggml_tensor * t) {
const char * c = ggml_is_contiguous(t) ? "" : "!";
if (t->ne[2] == 1 && t->ne[3] == 1) {
return sprintf(str, "%zu:%zu%s", (size_t) t->nb[0], (size_t) t->nb[1], c);
} else {
return sprintf(str, "%zu:%zu:%zu:%zu%s", (size_t) t->nb[0], (size_t) t->nb[1], (size_t) t->nb[2], (size_t) t->nb[3], c);
}
}
void format_op_strides(char * str, const struct ggml_tensor * t) {
char * p = str;
// append src0 and src1 (if any)
if (t->src[0]) {
p += format_tensor_strides(p, t->src[0]);
for (int i = 1; i < GGML_MAX_SRC && t->src[i]; i++) {
p += sprintf(p, " x ");
p += format_tensor_strides(p, t->src[i]);
}
p += sprintf(p, " -> ");
}
// format self dims separately for better visual alignment
char self[64];
format_tensor_strides(self, t);
p += sprintf(p, "%s", self);
}
void format_op_types(char * str, const struct ggml_tensor * t) {
char * p = str;
// append src0 and src1 (if any)
if (t->src[0]) {
p += sprintf(p, "%s", ggml_type_name(t->src[0]->type));
for (int i = 1; i < GGML_MAX_SRC && t->src[i]; i++) {
p += sprintf(p, " x ");
p += sprintf(p, "%s", ggml_type_name(t->src[i]->type));
}
p += sprintf(p, " -> ");
}
p += sprintf(p, "%s", ggml_type_name(t->type));
}
const char * tensor_buff_name(const struct ggml_tensor * t) {
if (t->buffer) {
return ggml_backend_buffer_name(t->buffer);
}
return "NONE";
}
void format_op_buffs(char * str, const struct ggml_tensor * t) {
char * p = str;
// append src0 and src1 (if any)
if (t->src[0]) {
p += sprintf(p, "%s", tensor_buff_name(t->src[0]));
for (int i = 1; i < GGML_MAX_SRC && t->src[i]; i++) {
p += sprintf(p, " x ");
p += sprintf(p, "%s", tensor_buff_name(t->src[i]));
}
p += sprintf(p, " -> ");
}
p += sprintf(p, "%s", tensor_buff_name(t));
}
void format_op_names(char * str, const struct ggml_tensor * t) {
char * p = str;
// append src0 and src1 (if any)
if (t->src[0]) {
p += sprintf(p, "%s", t->src[0]->name);
for (int i = 1; i < GGML_MAX_SRC && t->src[i]; i++) {
p += sprintf(p, " x ");
p += sprintf(p, "%s", t->src[i]->name);
}
p += sprintf(p, " -> ");
}
p += sprintf(p, "%s", t->name);
}
void format(const ggml_tensor * op) {
format_op_dims(dims, op);
format_op_strides(strides, op);
format_op_types(types, op);
format_op_buffs(buffs, op);
format_op_names(names, op);
}
op_desc() {}
op_desc(const ggml_tensor * op) { format(op); }
};
#endif // OP_DESC_H
+7 -1
View File
@@ -1732,6 +1732,8 @@ ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_rope(ggml_metal_
ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_im2col(ggml_metal_library_t lib, const ggml_tensor * op) {
assert(op->op == GGML_OP_IM2COL);
GGML_TENSOR_LOCALS(int64_t, ne0, op->src[0], ne);
GGML_ASSERT(ggml_is_contiguous(op->src[1]));
GGML_ASSERT(op->src[1]->type == GGML_TYPE_F32);
GGML_ASSERT(op->type == GGML_TYPE_F16 || op->type == GGML_TYPE_F32);
@@ -1739,7 +1741,11 @@ ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_im2col(ggml_meta
char base[256];
char name[256];
snprintf(base, 256, "kernel_im2col_%s", ggml_type_name(op->type));
if (ne00*ne01 <= 1024) {
snprintf(base, 256, "kernel_im2col_%s", ggml_type_name(op->type));
} else {
snprintf(base, 256, "kernel_im2col_ext_%s", ggml_type_name(op->type));
}
snprintf(name, 256, "%s", base);
ggml_metal_pipeline_with_params res = ggml_metal_library_get_pipeline(lib, name);
+1 -1
View File
@@ -1107,7 +1107,7 @@ bool ggml_metal_device_supports_op(ggml_metal_device_t dev, const struct ggml_te
case GGML_GLU_OP_SWIGLU_OAI:
case GGML_GLU_OP_GEGLU_ERF:
case GGML_GLU_OP_GEGLU_QUICK:
return ggml_is_contiguous_1(op->src[0]) && op->src[0]->type == GGML_TYPE_F32;
return ggml_is_contiguous_1(op->src[0]) && (op->src[0]->type == GGML_TYPE_F32 || op->src[0]->type == GGML_TYPE_F16);
default:
return false;
}
+17 -7
View File
@@ -3635,16 +3635,26 @@ int ggml_metal_op_im2col(ggml_metal_op_t ctx, int idx) {
auto pipeline = ggml_metal_library_get_pipeline_im2col(lib, op);
GGML_ASSERT(KH*KW <= ggml_metal_pipeline_max_theads_per_threadgroup(pipeline));
if (KH*KW <= ggml_metal_pipeline_max_theads_per_threadgroup(pipeline)) {
const uint64_t ntptg0 = std::min(ggml_metal_pipeline_max_theads_per_threadgroup(pipeline)/(KH*KW), N);
const uint64_t ntptg0 = std::min(ggml_metal_pipeline_max_theads_per_threadgroup(pipeline)/(KH*KW), N);
ggml_metal_encoder_set_pipeline(enc, pipeline);
ggml_metal_encoder_set_bytes (enc, &args, sizeof(args), 0);
ggml_metal_encoder_set_buffer (enc, ggml_metal_get_buffer_id(op->src[1]), 1);
ggml_metal_encoder_set_buffer (enc, ggml_metal_get_buffer_id(op), 2);
ggml_metal_encoder_set_pipeline(enc, pipeline);
ggml_metal_encoder_set_bytes (enc, &args, sizeof(args), 0);
ggml_metal_encoder_set_buffer (enc, ggml_metal_get_buffer_id(op->src[1]), 1);
ggml_metal_encoder_set_buffer (enc, ggml_metal_get_buffer_id(op), 2);
ggml_metal_encoder_dispatch_threadgroups(enc, IC, OH, OW, ntptg0, KH, KW);
} else {
const uint64_t n_threads = std::min(ggml_metal_pipeline_max_theads_per_threadgroup(pipeline), N);
const int64_t quotient = N / n_threads + (N % n_threads > 0 ? 1 : 0);
ggml_metal_encoder_dispatch_threadgroups(enc, IC, OH, OW, ntptg0, KH, KW);
ggml_metal_encoder_set_pipeline(enc, pipeline);
ggml_metal_encoder_set_bytes (enc, &args, sizeof(args), 0);
ggml_metal_encoder_set_buffer (enc, ggml_metal_get_buffer_id(op->src[1]), 1);
ggml_metal_encoder_set_buffer (enc, ggml_metal_get_buffer_id(op), 2);
ggml_metal_encoder_dispatch_threadgroups(enc, quotient * CHW, OH, OW, n_threads, 1, 1);
}
return 1;
}
+119 -83
View File
@@ -1421,7 +1421,8 @@ template [[host_name("kernel_repeat_f16")]] kernel kernel_repeat_t kernel_repeat
template [[host_name("kernel_repeat_i32")]] kernel kernel_repeat_t kernel_repeat<int>;
template [[host_name("kernel_repeat_i16")]] kernel kernel_repeat_t kernel_repeat<short>;
kernel void kernel_reglu_f32(
template<typename T>
kernel void kernel_reglu(
constant ggml_metal_kargs_glu & args,
device const char * src0,
device const char * src1,
@@ -1429,19 +1430,25 @@ kernel void kernel_reglu_f32(
uint tgpig[[threadgroup_position_in_grid]],
uint tpitg[[thread_position_in_threadgroup]],
uint ntg[[threads_per_threadgroup]]) {
device const float * src0_row = (device const float *) ((device const char *) src0 + tgpig*args.nb01) + args.i00;
device const float * src1_row = (device const float *) ((device const char *) src1 + tgpig*args.nb11) + args.i10;
device float * dst_row = (device float *) ((device char *) dst + tgpig*args.nb1);
device const T * src0_row = (device const T *) ((device const char *) src0 + tgpig*args.nb01) + args.i00;
device const T * src1_row = (device const T *) ((device const char *) src1 + tgpig*args.nb11) + args.i10;
device T * dst_row = (device T *) ((device char *) dst + tgpig*args.nb1);
for (int i0 = tpitg; i0 < args.ne0; i0 += ntg) {
const float x0 = src0_row[i0];
const float x1 = src1_row[i0];
dst_row[i0] = x0*x1*(x0 > 0.0f);
dst_row[i0] = (T)(x0*x1*(x0 > 0.0f));
}
}
kernel void kernel_geglu_f32(
typedef decltype(kernel_reglu<float>) kernel_reglu_t;
template [[host_name("kernel_reglu_f32")]] kernel kernel_reglu_t kernel_reglu<float>;
template [[host_name("kernel_reglu_f16")]] kernel kernel_reglu_t kernel_reglu<half>;
template<typename T>
kernel void kernel_geglu(
constant ggml_metal_kargs_glu & args,
device const char * src0,
device const char * src1,
@@ -1449,9 +1456,9 @@ kernel void kernel_geglu_f32(
uint tgpig[[threadgroup_position_in_grid]],
uint tpitg[[thread_position_in_threadgroup]],
uint ntg[[threads_per_threadgroup]]) {
device const float * src0_row = (device const float *) ((device const char *) src0 + tgpig*args.nb01) + args.i00;
device const float * src1_row = (device const float *) ((device const char *) src1 + tgpig*args.nb11) + args.i10;
device float * dst_row = (device float *) ((device char *) dst + tgpig*args.nb1);
device const T * src0_row = (device const T *) ((device const char *) src0 + tgpig*args.nb01) + args.i00;
device const T * src1_row = (device const T *) ((device const char *) src1 + tgpig*args.nb11) + args.i10;
device T * dst_row = (device T *) ((device char *) dst + tgpig*args.nb1);
for (int i0 = tpitg; i0 < args.ne0; i0 += ntg) {
const float x0 = src0_row[i0];
@@ -1459,11 +1466,17 @@ kernel void kernel_geglu_f32(
const float gelu = 0.5f*x0*(1.0f + precise::tanh(SQRT_2_OVER_PI*x0*(1.0f + GELU_COEF_A*x0*x0)));
dst_row[i0] = gelu*x1;
dst_row[i0] = (T)(gelu*x1);
}
}
kernel void kernel_swiglu_f32(
typedef decltype(kernel_geglu<float>) kernel_geglu_t;
template [[host_name("kernel_geglu_f32")]] kernel kernel_geglu_t kernel_geglu<float>;
template [[host_name("kernel_geglu_f16")]] kernel kernel_geglu_t kernel_geglu<half>;
template<typename T>
kernel void kernel_swiglu(
constant ggml_metal_kargs_glu & args,
device const char * src0,
device const char * src1,
@@ -1471,9 +1484,9 @@ kernel void kernel_swiglu_f32(
uint tgpig[[threadgroup_position_in_grid]],
uint tpitg[[thread_position_in_threadgroup]],
uint ntg[[threads_per_threadgroup]]) {
device const float * src0_row = (device const float *) ((device const char *) src0 + tgpig*args.nb01) + args.i00;
device const float * src1_row = (device const float *) ((device const char *) src1 + tgpig*args.nb11) + args.i10;
device float * dst_row = (device float *) ((device char *) dst + tgpig*args.nb1);
device const T * src0_row = (device const T *) ((device const char *) src0 + tgpig*args.nb01) + args.i00;
device const T * src1_row = (device const T *) ((device const char *) src1 + tgpig*args.nb11) + args.i10;
device T * dst_row = (device T *) ((device char *) dst + tgpig*args.nb1);
for (int i0 = tpitg; i0 < args.ne0; i0 += ntg) {
const float x0 = src0_row[i0];
@@ -1481,11 +1494,17 @@ kernel void kernel_swiglu_f32(
const float silu = x0 / (1.0f + exp(-x0));
dst_row[i0] = silu*x1;
dst_row[i0] = (T)(silu*x1);
}
}
kernel void kernel_swiglu_oai_f32(
typedef decltype(kernel_swiglu<float>) kernel_swiglu_t;
template [[host_name("kernel_swiglu_f32")]] kernel kernel_swiglu_t kernel_swiglu<float>;
template [[host_name("kernel_swiglu_f16")]] kernel kernel_swiglu_t kernel_swiglu<half>;
template<typename T>
kernel void kernel_swiglu_oai(
constant ggml_metal_kargs_glu & args,
device const char * src0,
device const char * src1,
@@ -1493,9 +1512,9 @@ kernel void kernel_swiglu_oai_f32(
uint tgpig[[threadgroup_position_in_grid]],
uint tpitg[[thread_position_in_threadgroup]],
uint ntg[[threads_per_threadgroup]]) {
device const float * src0_row = (device const float *) ((device const char *) src0 + tgpig*args.nb01) + args.i00;
device const float * src1_row = (device const float *) ((device const char *) src1 + tgpig*args.nb11) + args.i10;
device float * dst_row = (device float *) ((device char *) dst + tgpig*args.nb1);
device const T * src0_row = (device const T *) ((device const char *) src0 + tgpig*args.nb01) + args.i00;
device const T * src1_row = (device const T *) ((device const char *) src1 + tgpig*args.nb11) + args.i10;
device T * dst_row = (device T *) ((device char *) dst + tgpig*args.nb1);
for (int i0 = tpitg; i0 < args.ne0; i0 += ntg) {
float x0 = src0_row[i0];
@@ -1507,11 +1526,17 @@ kernel void kernel_swiglu_oai_f32(
float out_glu = x0 / (1.0f + exp(-x0 * args.alpha));
out_glu = out_glu * (1.0f + x1);
dst_row[i0] = out_glu;
dst_row[i0] = (T)out_glu;
}
}
kernel void kernel_geglu_erf_f32(
typedef decltype(kernel_swiglu_oai<float>) kernel_swiglu_oai_t;
template [[host_name("kernel_swiglu_oai_f32")]] kernel kernel_swiglu_oai_t kernel_swiglu_oai<float>;
template [[host_name("kernel_swiglu_oai_f16")]] kernel kernel_swiglu_oai_t kernel_swiglu_oai<half>;
template<typename T>
kernel void kernel_geglu_erf(
constant ggml_metal_kargs_glu & args,
device const char * src0,
device const char * src1,
@@ -1519,9 +1544,9 @@ kernel void kernel_geglu_erf_f32(
uint tgpig[[threadgroup_position_in_grid]],
uint tpitg[[thread_position_in_threadgroup]],
uint ntg[[threads_per_threadgroup]]) {
device const float * src0_row = (device const float *) ((device const char *) src0 + tgpig*args.nb01) + args.i00;
device const float * src1_row = (device const float *) ((device const char *) src1 + tgpig*args.nb11) + args.i10;
device float * dst_row = (device float *) ((device char *) dst + tgpig*args.nb1);
device const T * src0_row = (device const T *) ((device const char *) src0 + tgpig*args.nb01) + args.i00;
device const T * src1_row = (device const T *) ((device const char *) src1 + tgpig*args.nb11) + args.i10;
device T * dst_row = (device T *) ((device char *) dst + tgpig*args.nb1);
for (int i0 = tpitg; i0 < args.ne0; i0 += ntg) {
const float x0 = src0_row[i0];
@@ -1529,11 +1554,17 @@ kernel void kernel_geglu_erf_f32(
const float gelu_erf = 0.5f*x0*(1.0f+erf_approx<float>(x0*SQRT_2_INV));
dst_row[i0] = gelu_erf*x1;
dst_row[i0] = (T)(gelu_erf*x1);
}
}
kernel void kernel_geglu_quick_f32(
typedef decltype(kernel_geglu_erf<float>) kernel_geglu_erf_t;
template [[host_name("kernel_geglu_erf_f32")]] kernel kernel_geglu_erf_t kernel_geglu_erf<float>;
template [[host_name("kernel_geglu_erf_f16")]] kernel kernel_geglu_erf_t kernel_geglu_erf<half>;
template<typename T>
kernel void kernel_geglu_quick(
constant ggml_metal_kargs_glu & args,
device const char * src0,
device const char * src1,
@@ -1541,9 +1572,9 @@ kernel void kernel_geglu_quick_f32(
uint tgpig[[threadgroup_position_in_grid]],
uint tpitg[[thread_position_in_threadgroup]],
uint ntg[[threads_per_threadgroup]]) {
device const float * src0_row = (device const float *) ((device const char *) src0 + tgpig*args.nb01) + args.i00;
device const float * src1_row = (device const float *) ((device const char *) src1 + tgpig*args.nb11) + args.i10;
device float * dst_row = (device float *) ((device char *) dst + tgpig*args.nb1);
device const T * src0_row = (device const T *) ((device const char *) src0 + tgpig*args.nb01) + args.i00;
device const T * src1_row = (device const T *) ((device const char *) src1 + tgpig*args.nb11) + args.i10;
device T * dst_row = (device T *) ((device char *) dst + tgpig*args.nb1);
for (int i0 = tpitg; i0 < args.ne0; i0 += ntg) {
const float x0 = src0_row[i0];
@@ -1551,10 +1582,15 @@ kernel void kernel_geglu_quick_f32(
const float gelu_quick = x0*(1.0f/(1.0f+exp(GELU_QUICK_COEF*x0)));
dst_row[i0] = gelu_quick*x1;
dst_row[i0] = (T)(gelu_quick*x1);
}
}
typedef decltype(kernel_geglu_quick<float>) kernel_geglu_quick_t;
template [[host_name("kernel_geglu_quick_f32")]] kernel kernel_geglu_quick_t kernel_geglu_quick<float>;
template [[host_name("kernel_geglu_quick_f16")]] kernel kernel_geglu_quick_t kernel_geglu_quick<half>;
kernel void kernel_op_sum_f32(
constant ggml_metal_kargs_sum & args,
device const float * src0,
@@ -4696,59 +4732,59 @@ kernel void kernel_im2col(
template [[host_name("kernel_im2col_f32")]] kernel im2col_t kernel_im2col<float>;
template [[host_name("kernel_im2col_f16")]] kernel im2col_t kernel_im2col<half>;
// TODO: obsolete -- remove
//typedef void (im2col_ext_t)(
// constant ggml_metal_kargs_im2col & args,
// device const float * x,
// device char * dst,
// uint3 tgpig[[threadgroup_position_in_grid]],
// uint3 tgpg[[threadgroups_per_grid]],
// uint3 tpitg[[thread_position_in_threadgroup]],
// uint3 ntg[[threads_per_threadgroup]]);
//
//template <typename T>
//kernel void kernel_im2col_ext(
// constant ggml_metal_kargs_im2col & args,
// device const float * x,
// device char * dst,
// uint3 tgpig[[threadgroup_position_in_grid]],
// uint3 tgpg[[threadgroups_per_grid]], // tgpg[0] = D x IC x KH x KW, CHW = IC x KH x KW
// uint3 tpitg[[thread_position_in_threadgroup]],
// uint3 ntg[[threads_per_threadgroup]]) { // [M, 1, 1]
// const int64_t KHW = (int64_t)args.KHW;
//
// const int64_t d = tgpig[0] / args.CHW;
// const int64_t chw = tgpig[0] % args.CHW;
// const int64_t tgpig_0 = chw / KHW; // 0 ~ (IC - 1)
// const int64_t HW = tgpig[0] % KHW;
//
// const int64_t tpitg_0 = (d * ntg[0]) + tpitg[0];
// if (tpitg_0 >= args.N) {
// return;
// }
//
// const int64_t tpitg_1 = HW / args.KW;
// const int64_t tpitg_2 = HW % args.KW;
//
// const int64_t iiw = tgpig[2] * args.s0 + tpitg_2 * args.d0 - args.p0;
// const int64_t iih = tgpig[1] * args.s1 + tpitg_1 * args.d1 - args.p1;
//
// const int64_t offset_dst =
// (tpitg_0 * tgpg[1] * tgpg[2] + tgpig[1] * tgpg[2] + tgpig[2]) * args.CHW +
// (tgpig_0 * KHW + tpitg_1 * args.KW + tpitg_2);
//
// device T * pdst = (device T *) (dst);
//
// if (iih < 0 || iih >= args.IH || iiw < 0 || iiw >= args.IW) {
// pdst[offset_dst] = 0.0f;
// } else {
// const int64_t offset_src = tpitg_0 * args.ofs0 + tgpig_0 * args.ofs1;
// pdst[offset_dst] = x[offset_src + iih * args.IW + iiw];
// }
//}
//
//template [[host_name("kernel_im2col_ext_f32")]] kernel im2col_ext_t kernel_im2col_ext<float>;
//template [[host_name("kernel_im2col_ext_f16")]] kernel im2col_ext_t kernel_im2col_ext<half>;
// TODO: optimize
typedef void (im2col_ext_t)(
constant ggml_metal_kargs_im2col & args,
device const float * x,
device char * dst,
uint3 tgpig[[threadgroup_position_in_grid]],
uint3 tgpg[[threadgroups_per_grid]],
uint3 tpitg[[thread_position_in_threadgroup]],
uint3 ntg[[threads_per_threadgroup]]);
template <typename T>
kernel void kernel_im2col_ext(
constant ggml_metal_kargs_im2col & args,
device const float * x,
device char * dst,
uint3 tgpig[[threadgroup_position_in_grid]],
uint3 tgpg[[threadgroups_per_grid]], // tgpg[0] = D x IC x KH x KW, CHW = IC x KH x KW
uint3 tpitg[[thread_position_in_threadgroup]],
uint3 ntg[[threads_per_threadgroup]]) { // [M, 1, 1]
const int64_t KHW = (int64_t)args.KHW;
const int64_t d = tgpig[0] / args.CHW;
const int64_t chw = tgpig[0] % args.CHW;
const int64_t tgpig_0 = chw / KHW; // 0 ~ (IC - 1)
const int64_t HW = tgpig[0] % KHW;
const int64_t tpitg_0 = (d * ntg[0]) + tpitg[0];
if (tpitg_0 >= args.N) {
return;
}
const int64_t tpitg_1 = HW / args.KW;
const int64_t tpitg_2 = HW % args.KW;
const int64_t iiw = tgpig[2] * args.s0 + tpitg_2 * args.d0 - args.p0;
const int64_t iih = tgpig[1] * args.s1 + tpitg_1 * args.d1 - args.p1;
const int64_t offset_dst =
(tpitg_0 * tgpg[1] * tgpg[2] + tgpig[1] * tgpg[2] + tgpig[2]) * args.CHW +
(tgpig_0 * KHW + tpitg_1 * args.KW + tpitg_2);
device T * pdst = (device T *) (dst);
if (iih < 0 || iih >= args.IH || iiw < 0 || iiw >= args.IW) {
pdst[offset_dst] = 0.0f;
} else {
const int64_t offset_src = tpitg_0 * args.ofs0 + tgpig_0 * args.ofs1;
pdst[offset_dst] = x[offset_src + iih * args.IW + iiw];
}
}
template [[host_name("kernel_im2col_ext_f32")]] kernel im2col_ext_t kernel_im2col_ext<float>;
template [[host_name("kernel_im2col_ext_f16")]] kernel im2col_ext_t kernel_im2col_ext<half>;
template <typename TK>
kernel void kernel_conv_2d(
+6
View File
@@ -87,6 +87,10 @@ set(GGML_OPENCL_KERNELS
mul_mv_q4_1_f32_flat
mul_mv_q4_k_f32
mul_mv_q4_k_f32_flat
mul_mv_q5_0_f32
mul_mv_q5_0_f32_flat
mul_mv_q5_1_f32
mul_mv_q5_1_f32_flat
mul_mv_q5_k_f32
mul_mv_q5_k_f32_flat
mul_mv_q6_k_f32
@@ -126,6 +130,8 @@ set(GGML_OPENCL_KERNELS
mul_mm_f16_f32_l4_lm
mul_mm_q4_0_f32_l4_lm
mul_mm_q4_1_f32_l4_lm
mul_mm_q5_0_f32_l4_lm
mul_mm_q5_1_f32_l4_lm
mul_mm_q8_0_f32_l4_lm
mul_mm_iq4_nl_f32_l4_lm
mul_mm_q4_k_f32_l4_lm
+582 -76
View File
@@ -379,6 +379,8 @@ struct ggml_backend_opencl_device_context {
GPU_FAMILY gpu_family = GPU_FAMILY::UNKNOWN;
ADRENO_GPU_GEN adreno_gen = ADRENO_GPU_GEN::ADRENO_UNKNOWN;
std::regex *opfilter = nullptr; // regex of ops to not claim
std::string opfilter_str; // regex string for opfilter
size_t global_mem_size = 0;
};
@@ -415,8 +417,6 @@ struct ggml_backend_opencl_context {
bool has_qcom_subgroup_shuffle = false; // cl_qcom_subgroup_shuffle
bool disable_fusion;
std::regex *opfilter = nullptr; // regex of ops to not claim
bool adreno_has_large_buffer;
bool adreno_use_large_buffer;
ggml_cl_compiler_version adreno_cl_compiler_version;
@@ -428,6 +428,8 @@ struct ggml_backend_opencl_context {
size_t image2d_max_width;
size_t image2d_max_height;
cl_device_svm_capabilities svm_caps;
cl_context context;
cl_command_queue queue;
@@ -574,7 +576,9 @@ struct ggml_backend_opencl_context {
cl_kernel kernel_convert_block_q4_0_trans4_ns, kernel_restore_block_q4_0_trans4_ns;
cl_kernel kernel_convert_block_q4_1, kernel_restore_block_q4_1;
cl_kernel kernel_convert_block_q4_1_trans4_ns, kernel_restore_block_q4_1_trans4_ns;
cl_kernel kernel_convert_block_q5_0, kernel_restore_block_q5_0;
cl_kernel kernel_convert_block_q5_0_trans4_ns, kernel_restore_block_q5_0_trans4_ns;
cl_kernel kernel_convert_block_q5_1, kernel_restore_block_q5_1;
cl_kernel kernel_convert_block_q5_1_trans4_ns, kernel_restore_block_q5_1_trans4_ns;
cl_kernel kernel_convert_block_q4_k_trans4_ns, kernel_restore_block_q4_k_trans4_ns;
cl_kernel kernel_convert_block_q5_k_trans4_ns, kernel_restore_block_q5_k_trans4_ns;
@@ -583,6 +587,7 @@ struct ggml_backend_opencl_context {
cl_kernel kernel_convert_block_mxfp4_trans4_ns, kernel_restore_block_mxfp4_trans4_ns;
cl_kernel kernel_convert_block_q8_0, kernel_restore_block_q8_0, kernel_restore_block_q8_0_trans;
cl_kernel kernel_convert_block_q6_K_noshuffle, kernel_restore_block_q6_K_noshuffle;
cl_kernel kernel_convert_bf16_to_f16, kernel_convert_f16_to_bf16;
cl_kernel kernel_mul_mat_q4_0_f32_8x_flat;
cl_kernel kernel_convert_block_q4_0_noshuffle;
cl_kernel kernel_restore_block_q4_0_noshuffle;
@@ -601,6 +606,10 @@ struct ggml_backend_opencl_context {
cl_kernel kernel_mul_mat_q4_0_f32_1d_8x_flat, kernel_mul_mat_q4_0_f32_1d_16x_flat;
cl_kernel kernel_mul_mv_q4_1_f32;
cl_kernel kernel_mul_mv_q4_1_f32_flat;
cl_kernel kernel_mul_mv_q5_0_f32;
cl_kernel kernel_mul_mv_q5_0_f32_flat;
cl_kernel kernel_mul_mv_q5_1_f32;
cl_kernel kernel_mul_mv_q5_1_f32_flat;
cl_kernel kernel_mul_mv_q4_K_f32;
cl_kernel kernel_mul_mv_q4_K_f32_flat;
cl_kernel kernel_mul_mv_q5_K_f32;
@@ -659,6 +668,8 @@ struct ggml_backend_opencl_context {
cl_kernel kernel_mul_mm_f16_f32_l4_lm;
cl_kernel kernel_mul_mm_q4_0_f32_l4_lm;
cl_kernel kernel_mul_mm_q4_1_f32_l4_lm;
cl_kernel kernel_mul_mm_q5_0_f32_l4_lm;
cl_kernel kernel_mul_mm_q5_1_f32_l4_lm;
cl_kernel kernel_mul_mm_q8_0_f32_l4_lm;
cl_kernel kernel_mul_mm_q4_k_f32_l4_lm;
cl_kernel kernel_mul_mm_q5_k_f32_l4_lm;
@@ -1138,8 +1149,12 @@ static void load_cl_kernels(ggml_backend_opencl_context *backend_ctx) {
CL_CHECK((backend_ctx->kernel_restore_block_q4_1 = clCreateKernel(backend_ctx->program_cvt, "kernel_restore_block_q4_1", &err), err));
CL_CHECK((backend_ctx->kernel_convert_block_q4_1_trans4_ns = clCreateKernel(backend_ctx->program_cvt, "kernel_convert_block_q4_1_trans4_ns", &err), err));
CL_CHECK((backend_ctx->kernel_restore_block_q4_1_trans4_ns = clCreateKernel(backend_ctx->program_cvt, "kernel_restore_block_q4_1_trans4_ns", &err), err));
CL_CHECK((backend_ctx->kernel_convert_block_q5_0 = clCreateKernel(backend_ctx->program_cvt, "kernel_convert_block_q5_0", &err), err));
CL_CHECK((backend_ctx->kernel_restore_block_q5_0 = clCreateKernel(backend_ctx->program_cvt, "kernel_restore_block_q5_0", &err), err));
CL_CHECK((backend_ctx->kernel_convert_block_q5_0_trans4_ns = clCreateKernel(backend_ctx->program_cvt, "kernel_convert_block_q5_0_trans4_ns", &err), err));
CL_CHECK((backend_ctx->kernel_restore_block_q5_0_trans4_ns = clCreateKernel(backend_ctx->program_cvt, "kernel_restore_block_q5_0_trans4_ns", &err), err));
CL_CHECK((backend_ctx->kernel_convert_block_q5_1 = clCreateKernel(backend_ctx->program_cvt, "kernel_convert_block_q5_1", &err), err));
CL_CHECK((backend_ctx->kernel_restore_block_q5_1 = clCreateKernel(backend_ctx->program_cvt, "kernel_restore_block_q5_1", &err), err));
CL_CHECK((backend_ctx->kernel_convert_block_q5_1_trans4_ns = clCreateKernel(backend_ctx->program_cvt, "kernel_convert_block_q5_1_trans4_ns", &err), err));
CL_CHECK((backend_ctx->kernel_restore_block_q5_1_trans4_ns = clCreateKernel(backend_ctx->program_cvt, "kernel_restore_block_q5_1_trans4_ns", &err), err));
CL_CHECK((backend_ctx->kernel_convert_block_q4_k_trans4_ns = clCreateKernel(backend_ctx->program_cvt, "kernel_convert_block_q4_k_trans4_ns", &err), err));
@@ -1173,6 +1188,8 @@ static void load_cl_kernels(ggml_backend_opencl_context *backend_ctx) {
CL_CHECK((backend_ctx->kernel_restore_block_iq4_nl = clCreateKernel(backend_ctx->program_cvt, "kernel_restore_block_iq4_nl", &err), err));
CL_CHECK((backend_ctx->kernel_convert_block_iq4_nl_noshuffle = clCreateKernel(backend_ctx->program_cvt, "kernel_convert_block_iq4_nl_noshuffle", &err), err));
CL_CHECK((backend_ctx->kernel_restore_block_iq4_nl_noshuffle = clCreateKernel(backend_ctx->program_cvt, "kernel_restore_block_iq4_nl_noshuffle", &err), err));
CL_CHECK((backend_ctx->kernel_convert_bf16_to_f16 = clCreateKernel(backend_ctx->program_cvt, "kernel_convert_bf16_to_f16", &err), err));
CL_CHECK((backend_ctx->kernel_convert_f16_to_bf16 = clCreateKernel(backend_ctx->program_cvt, "kernel_convert_f16_to_bf16", &err), err));
GGML_LOG_CONT(".");
}
@@ -1480,6 +1497,74 @@ static void load_cl_kernels(ggml_backend_opencl_context *backend_ctx) {
GGML_LOG_CONT(".");
}
// mul_mv_q5_0_f32
{
#ifdef GGML_OPENCL_EMBED_KERNELS
const std::string kernel_src {
#include "mul_mv_q5_0_f32.cl.h"
};
#else
const std::string kernel_src = read_file("mul_mv_q5_0_f32.cl");
#endif
cl_program prog =
build_program_from_source(backend_ctx->context, backend_ctx->device, kernel_src.c_str(), compile_opts);
CL_CHECK((backend_ctx->kernel_mul_mv_q5_0_f32 = clCreateKernel(prog, "kernel_mul_mv_q5_0_f32", &err), err));
CL_CHECK(clReleaseProgram(prog));
GGML_LOG_CONT(".");
}
// mul_mv_q5_0_f32_flat
{
#ifdef GGML_OPENCL_EMBED_KERNELS
const std::string kernel_src {
#include "mul_mv_q5_0_f32_flat.cl.h"
};
#else
const std::string kernel_src = read_file("mul_mv_q5_0_f32_flat.cl");
#endif
cl_program prog =
build_program_from_source(backend_ctx->context, backend_ctx->device, kernel_src.c_str(), compile_opts);
CL_CHECK((backend_ctx->kernel_mul_mv_q5_0_f32_flat = clCreateKernel(prog, "kernel_mul_mv_q5_0_f32_flat", &err), err));
CL_CHECK(clReleaseProgram(prog));
GGML_LOG_CONT(".");
}
// mul_mv_q5_1_f32
{
#ifdef GGML_OPENCL_EMBED_KERNELS
const std::string kernel_src {
#include "mul_mv_q5_1_f32.cl.h"
};
#else
const std::string kernel_src = read_file("mul_mv_q5_1_f32.cl");
#endif
cl_program prog =
build_program_from_source(backend_ctx->context, backend_ctx->device, kernel_src.c_str(), compile_opts);
CL_CHECK((backend_ctx->kernel_mul_mv_q5_1_f32 = clCreateKernel(prog, "kernel_mul_mv_q5_1_f32", &err), err));
CL_CHECK(clReleaseProgram(prog));
GGML_LOG_CONT(".");
}
// mul_mv_q5_1_f32_flat
{
#ifdef GGML_OPENCL_EMBED_KERNELS
const std::string kernel_src {
#include "mul_mv_q5_1_f32_flat.cl.h"
};
#else
const std::string kernel_src = read_file("mul_mv_q5_1_f32_flat.cl");
#endif
cl_program prog =
build_program_from_source(backend_ctx->context, backend_ctx->device, kernel_src.c_str(), compile_opts);
CL_CHECK((backend_ctx->kernel_mul_mv_q5_1_f32_flat = clCreateKernel(prog, "kernel_mul_mv_q5_1_f32_flat", &err), err));
CL_CHECK(clReleaseProgram(prog));
GGML_LOG_CONT(".");
}
// mul_mv_q5_k_f32
{
#ifdef GGML_OPENCL_EMBED_KERNELS
@@ -1830,6 +1915,38 @@ static void load_cl_kernels(ggml_backend_opencl_context *backend_ctx) {
GGML_LOG_CONT(".");
}
// mul_mm_q5_0_f32_l4_lm
{
#ifdef GGML_OPENCL_EMBED_KERNELS
const std::string kernel_src {
#include "mul_mm_q5_0_f32_l4_lm.cl.h"
};
#else
const std::string kernel_src = read_file("mul_mm_q5_0_f32_l4_lm.cl");
#endif
cl_program prog =
build_program_from_source(backend_ctx->context, backend_ctx->device, kernel_src.c_str(), compile_opts);
CL_CHECK((backend_ctx->kernel_mul_mm_q5_0_f32_l4_lm = clCreateKernel(prog, "kernel_mul_mm_q5_0_f32_l4_lm", &err), err));
GGML_LOG_CONT(".");
}
// mul_mm_q5_1_f32_l4_lm
{
#ifdef GGML_OPENCL_EMBED_KERNELS
const std::string kernel_src {
#include "mul_mm_q5_1_f32_l4_lm.cl.h"
};
#else
const std::string kernel_src = read_file("mul_mm_q5_1_f32_l4_lm.cl");
#endif
cl_program prog =
build_program_from_source(backend_ctx->context, backend_ctx->device, kernel_src.c_str(), compile_opts);
CL_CHECK((backend_ctx->kernel_mul_mm_q5_1_f32_l4_lm = clCreateKernel(prog, "kernel_mul_mm_q5_1_f32_l4_lm", &err), err));
GGML_LOG_CONT(".");
}
// mul_mm_q8_0_f32_l4_lm
{
#ifdef GGML_OPENCL_EMBED_KERNELS
@@ -3731,6 +3848,68 @@ static std::vector<ggml_backend_device> ggml_opencl_probe_devices(ggml_backend_r
return found_devices;
}
static void ggml_opencl_print_backend_info(ggml_backend_opencl_device_context * dev_ctx) {
GGML_ASSERT(dev_ctx);
GGML_ASSERT(dev_ctx->backend_ctx);
auto * backend_ctx = dev_ctx->backend_ctx;
GGML_LOG_INFO("ggml_opencl: OpenCL driver: %s\n",
backend_ctx->driver_version.c_str());
GGML_LOG_INFO("ggml_opencl: vector subgroup broadcast support: %s\n",
backend_ctx->has_vector_subgroup_broadcast ? "true" : "false");
GGML_LOG_INFO("ggml_opencl: device FP16 support: %s\n",
backend_ctx->fp16_support ? "true" : "false");
GGML_LOG_INFO("ggml_opencl: mem base addr align: %u\n",
backend_ctx->alignment);
GGML_LOG_INFO("ggml_opencl: global mem size: %zu MB\n",
backend_ctx->global_mem_size/1024/1024);
GGML_LOG_INFO("ggml_opencl: max mem alloc size: %zu MB\n",
backend_ctx->max_alloc_size/1024/1024);
GGML_LOG_INFO("ggml_opencl: device max image buffer size (pixels): %lu\n",
backend_ctx->image_max_buffer_size);
GGML_LOG_INFO("ggml_opencl: device max image2d size: %lu x %lu\n",
backend_ctx->image2d_max_width, backend_ctx->image2d_max_height);
GGML_LOG_INFO("ggml_opencl: device max workgroup size: %lu\n",
backend_ctx->max_workgroup_size);
GGML_LOG_INFO("ggml_opencl: SVM coarse grain buffer support: %s\n",
backend_ctx->svm_caps & CL_DEVICE_SVM_COARSE_GRAIN_BUFFER ? "true" : "false");
GGML_LOG_INFO("ggml_opencl: SVM fine grain buffer support: %s\n",
backend_ctx->svm_caps & CL_DEVICE_SVM_FINE_GRAIN_BUFFER ? "true" : "false");
GGML_LOG_INFO("ggml_opencl: SVM fine grain system support: %s\n",
backend_ctx->svm_caps & CL_DEVICE_SVM_FINE_GRAIN_SYSTEM ? "true" : "false");
GGML_LOG_INFO("ggml_opencl: SVM atomics support: %s\n",
backend_ctx->svm_caps & CL_DEVICE_SVM_ATOMICS ? "true" : "false");
GGML_LOG_INFO("ggml_opencl: cl_qcom_subgroup_shuffle support: %s\n",
backend_ctx->has_qcom_subgroup_shuffle ? "true" : "false");
// Print out configurations
#ifdef GGML_OPENCL_SOA_Q
GGML_LOG_INFO("ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)\n");
#endif // GGML_OPENCL_SOA_Q
#ifdef GGML_OPENCL_USE_ADRENO_KERNELS
GGML_LOG_INFO("ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)\n");
if (backend_ctx->adreno_xmem_gemm_enabled) {
GGML_LOG_INFO("ggml_opencl: Adreno xmem F16xF32 GEMM enabled (temporary weight prepack)\n");
}
#endif // GGML_OPENCL_USE_ADRENO_KERNELS
if (backend_ctx->adreno_use_large_buffer) {
if (!backend_ctx->adreno_has_large_buffer) {
GGML_LOG_INFO("ggml_opencl: Adreno large buffer requested but not supported by driver, will use regular buffer\n");
backend_ctx->adreno_use_large_buffer = false;
} else {
GGML_LOG_INFO("ggml_opencl: Adreno large buffer enabled\n");
}
}
if (dev_ctx->opfilter) {
// for information only, the actual regex object is created in ggml_opencl_is_device_supported
GGML_LOG_INFO("ggml_opencl: opfilter regex = \"%s\"\n", dev_ctx->opfilter_str.c_str());
}
}
// check if device should be accepted
static bool ggml_opencl_is_device_supported(ggml_backend_dev_t dev) {
GGML_ASSERT(dev);
@@ -3799,6 +3978,13 @@ static bool ggml_opencl_is_device_supported(ggml_backend_dev_t dev) {
}
clGetDeviceInfo(dev_ctx->device, CL_DEVICE_GLOBAL_MEM_SIZE, sizeof(size_t), &dev_ctx->global_mem_size, NULL);
const char * str_opfilter = getenv("GGML_OPENCL_OPFILTER");
if (str_opfilter) {
dev_ctx->opfilter_str = str_opfilter;
dev_ctx->opfilter = new std::regex(str_opfilter, std::regex_constants::icase);
}
return true;
}
@@ -3850,15 +4036,12 @@ static ggml_backend_opencl_context * ggml_cl_init(ggml_backend_dev_t dev) {
char *driver_version = (char *)alloca(driver_version_str_size + 1);
clGetDeviceInfo(device, CL_DRIVER_VERSION, driver_version_str_size, driver_version, NULL);
driver_version[driver_version_str_size] = '\0';
GGML_LOG_INFO("ggml_opencl: OpenCL driver: %s\n", driver_version);
backend_ctx->driver_version = driver_version;
backend_ctx->adreno_cl_compiler_version = get_adreno_cl_compiler_version(driver_version);
backend_ctx->has_vector_subgroup_broadcast =
(backend_ctx->adreno_cl_compiler_version.type == E031 && backend_ctx->adreno_cl_compiler_version.major >= 47) ||
(backend_ctx->adreno_cl_compiler_version.type == DX && backend_ctx->adreno_cl_compiler_version.major >= 17);
GGML_LOG_INFO("ggml_opencl: vector subgroup broadcast support: %s\n",
backend_ctx->has_vector_subgroup_broadcast ? "true" : "false");
size_t ext_str_size;
clGetDeviceInfo(device, CL_DEVICE_EXTENSIONS, 0, NULL, &ext_str_size);
@@ -3867,18 +4050,12 @@ static ggml_backend_opencl_context * ggml_cl_init(ggml_backend_dev_t dev) {
ext_buffer[ext_str_size] = '\0'; // ensure it is null terminated
// check support for qcom_subgroup_shuffle
if (opencl_c_version.major == 3 && strstr(ext_buffer, "cl_khr_subgroups") != NULL) {
GGML_LOG_INFO("ggml_opencl: cl_khr_subgroups support: true\n");
if (strstr(ext_buffer, "cl_qcom_subgroup_shuffle") != NULL) {
backend_ctx->has_qcom_subgroup_shuffle = true;
}
if (strstr(ext_buffer, "cl_qcom_subgroup_shuffle") != NULL) {
backend_ctx->has_qcom_subgroup_shuffle = true;
}
GGML_LOG_INFO("ggml_opencl: cl_qcom_subgroup_shuffle support: %s\n",
backend_ctx->has_qcom_subgroup_shuffle ? "true" : "false");
// Check if ext_buffer contains cl_khr_fp16
backend_ctx->fp16_support = strstr(ext_buffer, "cl_khr_fp16") != NULL;
GGML_LOG_INFO("ggml_opencl: device FP16 support: %s\n", backend_ctx->fp16_support ? "true" : "false");
// check Adreno large buffer support
backend_ctx->adreno_has_large_buffer = strstr(ext_buffer, "cl_qcom_large_buffer") != NULL;
@@ -3887,35 +4064,15 @@ static ggml_backend_opencl_context * ggml_cl_init(ggml_backend_dev_t dev) {
CL_CHECK(clGetDeviceInfo(device, CL_DEVICE_MEM_BASE_ADDR_ALIGN, sizeof(cl_uint), &base_align_in_bits, NULL));
GGML_ASSERT(base_align_in_bits % 8u == 0);
backend_ctx->alignment = base_align_in_bits / 8u;
GGML_LOG_INFO("ggml_opencl: mem base addr align: %u\n", backend_ctx->alignment);
backend_ctx->global_mem_size = dev_ctx->global_mem_size;
GGML_LOG_INFO("ggml_opencl: global mem size: %zu MB\n", backend_ctx->global_mem_size/1024/1024);
clGetDeviceInfo(device, CL_DEVICE_MAX_MEM_ALLOC_SIZE, sizeof(size_t), &backend_ctx->max_alloc_size, NULL);
GGML_LOG_INFO("ggml_opencl: max mem alloc size: %zu MB\n", backend_ctx->max_alloc_size/1024/1024);
clGetDeviceInfo(device, CL_DEVICE_IMAGE_MAX_BUFFER_SIZE, sizeof(size_t), &backend_ctx->image_max_buffer_size, NULL);
GGML_LOG_INFO("ggml_opencl: device max image buffer size (pixels): %lu\n", backend_ctx->image_max_buffer_size);
clGetDeviceInfo(device, CL_DEVICE_IMAGE2D_MAX_WIDTH, sizeof(size_t), &backend_ctx->image2d_max_width, NULL);
clGetDeviceInfo(device, CL_DEVICE_IMAGE2D_MAX_HEIGHT, sizeof(size_t), &backend_ctx->image2d_max_height, NULL);
GGML_LOG_INFO("ggml_opencl: device max image2d size: %lu x %lu\n", backend_ctx->image2d_max_width, backend_ctx->image2d_max_height);
clGetDeviceInfo(device, CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(size_t), &backend_ctx->max_workgroup_size, NULL);
GGML_LOG_INFO("ggml_opencl: device max workgroup size: %lu\n", backend_ctx->max_workgroup_size);
// Check SVM.
cl_device_svm_capabilities svm_caps;
CL_CHECK(clGetDeviceInfo(device, CL_DEVICE_SVM_CAPABILITIES, sizeof(cl_device_svm_capabilities), &svm_caps, 0));
GGML_LOG_INFO("ggml_opencl: SVM coarse grain buffer support: %s\n",
svm_caps & CL_DEVICE_SVM_COARSE_GRAIN_BUFFER ? "true" : "false");
GGML_LOG_INFO("ggml_opencl: SVM fine grain buffer support: %s\n",
svm_caps & CL_DEVICE_SVM_FINE_GRAIN_BUFFER ? "true" : "false");
GGML_LOG_INFO("ggml_opencl: SVM fine grain system support: %s\n",
svm_caps & CL_DEVICE_SVM_FINE_GRAIN_SYSTEM ? "true" : "false");
GGML_LOG_INFO("ggml_opencl: SVM atomics support: %s\n",
svm_caps & CL_DEVICE_SVM_ATOMICS ? "true" : "false");
CL_CHECK(clGetDeviceInfo(device, CL_DEVICE_MAX_MEM_ALLOC_SIZE, sizeof(size_t), &backend_ctx->max_alloc_size, NULL));
CL_CHECK(clGetDeviceInfo(device, CL_DEVICE_IMAGE_MAX_BUFFER_SIZE, sizeof(size_t), &backend_ctx->image_max_buffer_size, NULL));
CL_CHECK(clGetDeviceInfo(device, CL_DEVICE_IMAGE2D_MAX_WIDTH, sizeof(size_t), &backend_ctx->image2d_max_width, NULL));
CL_CHECK(clGetDeviceInfo(device, CL_DEVICE_IMAGE2D_MAX_HEIGHT, sizeof(size_t), &backend_ctx->image2d_max_height, NULL));
CL_CHECK(clGetDeviceInfo(device, CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(size_t), &backend_ctx->max_workgroup_size, NULL));
CL_CHECK(clGetDeviceInfo(device, CL_DEVICE_SVM_CAPABILITIES, sizeof(cl_device_svm_capabilities), &backend_ctx->svm_caps, 0));
if (opencl_c_version.major >= 3) {
// Assume it is not available for 3.0, since it is optional in 3.0.
@@ -3931,36 +4088,15 @@ static ggml_backend_opencl_context * ggml_cl_init(ggml_backend_dev_t dev) {
backend_ctx->non_uniform_workgroups = true;
}
// Print out configurations
#ifdef GGML_OPENCL_SOA_Q
GGML_LOG_INFO("ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)\n");
#endif // GGML_OPENCL_SOA_Q
#ifdef GGML_OPENCL_USE_ADRENO_KERNELS
GGML_LOG_INFO("ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)\n");
#endif // GGML_OPENCL_USE_ADRENO_KERNELS
#ifdef GGML_OPENCL_USE_ADRENO_KERNELS
// determine whether to use Adreno xmem GEMM
backend_ctx->adreno_xmem_gemm_enabled = getenv("GGML_OPENCL_ADRENO_XMEM_GEMM") != nullptr &&
backend_ctx->gpu_family == GPU_FAMILY::ADRENO;
if (getenv("GGML_OPENCL_ADRENO_XMEM_GEMM") != nullptr) {
GGML_LOG_INFO("ggml_opencl: Adreno xmem F16xF32 GEMM %s\n",
backend_ctx->adreno_xmem_gemm_enabled ?
"enabled (temporary weight prepack)" : "requested but unsupported by this driver");
}
#endif // GGML_OPENCL_USE_ADRENO_KERNELS
#endif
// determine whether to use large buffer for Adreno
backend_ctx->adreno_use_large_buffer = getenv("GGML_OPENCL_ADRENO_USE_LARGE_BUFFER") != nullptr &&
backend_ctx->gpu_family == GPU_FAMILY::ADRENO;
if (backend_ctx->adreno_use_large_buffer) {
if (!backend_ctx->adreno_has_large_buffer) {
GGML_LOG_INFO("ggml_opencl: Adreno large buffer requested but not supported by driver, will use regular buffer\n");
backend_ctx->adreno_use_large_buffer = false;
} else {
GGML_LOG_INFO("ggml_opencl: Adreno large buffer enabled\n");
}
}
cl_int err;
@@ -4010,12 +4146,6 @@ static ggml_backend_opencl_context * ggml_cl_init(ggml_backend_dev_t dev) {
backend_ctx->disable_fusion = getenv("GGML_OPENCL_DISABLE_FUSION") != nullptr;
const char * str_opfilter = getenv("GGML_OPENCL_OPFILTER");
if (str_opfilter) {
backend_ctx->opfilter = new std::regex(str_opfilter, std::regex_constants::icase);
GGML_LOG_INFO("ggml_opencl: opfilter regex = \"%s\"\n", str_opfilter);
}
dev_ctx->backend_ctx = backend_ctx.release();
return dev_ctx->backend_ctx;
}
@@ -4825,7 +4955,7 @@ static bool ggml_opencl_supports_op(ggml_backend_dev_t dev, const struct ggml_te
ggml_backend_opencl_context * backend_ctx = dev_ctx->backend_ctx;
// reject ops that match the opfilter regex
if (backend_ctx->opfilter && std::regex_match(std::string(ggml_op_desc(op)), *backend_ctx->opfilter)) {
if (dev_ctx->opfilter && std::regex_match(std::string(ggml_op_desc(op)), *dev_ctx->opfilter)) {
return false;
}
@@ -5004,9 +5134,12 @@ static bool ggml_opencl_supports_op(ggml_backend_dev_t dev, const struct ggml_te
case GGML_OP_MUL_MAT:
if (op->src[0]->type == GGML_TYPE_F16) {
return true;
} else if (op->src[0]->type == GGML_TYPE_BF16) {
return true;
} else if (op->src[0]->type == GGML_TYPE_F32) {
return op->src[1]->type == GGML_TYPE_F32;
} else if (op->src[0]->type == GGML_TYPE_Q4_0 || op->src[0]->type == GGML_TYPE_Q4_1 ||
op->src[0]->type == GGML_TYPE_Q5_0 || op->src[0]->type == GGML_TYPE_Q5_1 ||
op->src[0]->type == GGML_TYPE_MXFP4 ||
op->src[0]->type == GGML_TYPE_IQ4_NL ||
op->src[0]->type == GGML_TYPE_Q4_K ||
@@ -5957,7 +6090,24 @@ static void ggml_backend_opencl_buffer_set_tensor(ggml_backend_buffer_t buffer,
return;
}
#endif // GGML_OPENCL_USE_ADRENO_KERNELS
return;
cl_kernel kernel = backend_ctx->kernel_convert_block_q5_0;
cl_ulong n_blk = ggml_nelements(tensor)/ggml_blck_size(tensor->type);
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &data_device));
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_mem), &extra->qs));
CL_CHECK(clSetKernelArg(kernel, 2, sizeof(cl_mem), &extra->qh));
CL_CHECK(clSetKernelArg(kernel, 3, sizeof(cl_mem), &extra->d));
CL_CHECK(clSetKernelArg(kernel, 4, sizeof(cl_ulong), &n_blk));
size_t global_work_size[] = {(size_t)CEIL_DIV(n_blk, 64) * 64, 1, 1};
size_t local_work_size[] = {64, 1, 1};
cl_event evt;
CL_CHECK(clEnqueueNDRangeKernel(queue, kernel, 3, NULL, global_work_size, local_work_size, 0, NULL, &evt));
CL_CHECK(clWaitForEvents(1, &evt));
CL_CHECK(clReleaseMemObject(data_device));
tensor->extra = extra;
return;
}
if (tensor->type == GGML_TYPE_Q5_1) {
ggml_tensor_extra_cl * extra_orig = (ggml_tensor_extra_cl *)tensor->extra;
@@ -6058,6 +6208,24 @@ static void ggml_backend_opencl_buffer_set_tensor(ggml_backend_buffer_t buffer,
return;
}
#endif // GGML_OPENCL_USE_ADRENO_KERNELS
cl_kernel kernel = backend_ctx->kernel_convert_block_q5_1;
cl_ulong n_blk = ggml_nelements(tensor)/ggml_blck_size(tensor->type);
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &data_device));
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_mem), &extra->qs));
CL_CHECK(clSetKernelArg(kernel, 2, sizeof(cl_mem), &extra->qh));
CL_CHECK(clSetKernelArg(kernel, 3, sizeof(cl_mem), &extra->d));
CL_CHECK(clSetKernelArg(kernel, 4, sizeof(cl_mem), &extra->m));
CL_CHECK(clSetKernelArg(kernel, 5, sizeof(cl_ulong), &n_blk));
size_t global_work_size[] = {(size_t)CEIL_DIV(n_blk, 64) * 64, 1, 1};
size_t local_work_size[] = {64, 1, 1};
cl_event evt;
CL_CHECK(clEnqueueNDRangeKernel(queue, kernel, 3, NULL, global_work_size, local_work_size, 0, NULL, &evt));
CL_CHECK(clWaitForEvents(1, &evt));
CL_CHECK(clReleaseMemObject(data_device));
tensor->extra = extra;
return;
}
if (tensor->type == GGML_TYPE_MXFP4) {
@@ -6813,6 +6981,40 @@ static void ggml_backend_opencl_buffer_set_tensor(ggml_backend_buffer_t buffer,
}
#endif // GGML_OPENCL_SOA_Q
// convert bf16 to f16 and store as f16 in device buffer
if (tensor->type == GGML_TYPE_BF16) {
GGML_ASSERT(offset % sizeof(ggml_fp16_t) == 0 && size % sizeof(ggml_fp16_t) == 0
&& "Offset and size must be multiples of 2 for bf16 tensors");
ggml_tensor_extra_cl * extra = (ggml_tensor_extra_cl *) tensor->extra;
GGML_ASSERT(extra);
cl_ulong n_elements = size / sizeof(ggml_fp16_t);
cl_ulong off_dst = (extra->offset + offset) / sizeof(ggml_fp16_t);
cl_int err;
cl_mem data_device = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
size, (void *) data, &err);
CL_CHECK(err);
cl_kernel kernel = backend_ctx->kernel_convert_bf16_to_f16;
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &data_device));
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_mem), &extra->data_device));
CL_CHECK(clSetKernelArg(kernel, 2, sizeof(cl_ulong), &off_dst));
CL_CHECK(clSetKernelArg(kernel, 3, sizeof(cl_ulong), &n_elements));
size_t global_work_size[] = { (size_t)CEIL_DIV(n_elements, 64)*64, 1, 1 };
size_t local_work_size[] = { 64, 1, 1 };
cl_event evt;
CL_CHECK(clEnqueueNDRangeKernel(queue, kernel, 3, NULL, global_work_size, local_work_size, 0, NULL, &evt));
CL_CHECK(clWaitForEvents(1, &evt));
CL_CHECK(clReleaseMemObject(data_device));
CL_CHECK(clReleaseEvent(evt));
return;
}
ggml_tensor_extra_cl * extra = (ggml_tensor_extra_cl *) tensor->extra;
GGML_ASSERT(extra);
@@ -7081,8 +7283,29 @@ static void ggml_backend_opencl_buffer_get_tensor(ggml_backend_buffer_t buffer,
return;
}
#endif // GGML_OPENCL_USE_ADRENO_KERNELS
// TODO: normal q5_0
(void) extra;
cl_int err;
cl_mem data_device = clCreateBuffer(context, CL_MEM_READ_WRITE,
ggml_nbytes(tensor), NULL, &err);
CL_CHECK(err);
cl_kernel kernel = backend_ctx->kernel_restore_block_q5_0;
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &extra->qs));
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_mem), &extra->qh));
CL_CHECK(clSetKernelArg(kernel, 2, sizeof(cl_mem), &extra->d));
CL_CHECK(clSetKernelArg(kernel, 3, sizeof(cl_mem), &data_device));
size_t global_work_size[] = {(size_t)ggml_nelements(tensor)/ggml_blck_size(tensor->type), 1, 1};
size_t local_work_size[] = {1, 1, 1};
cl_event evt;
CL_CHECK(clEnqueueNDRangeKernel(queue, kernel, 3, NULL,
global_work_size, local_work_size, 0, NULL, &evt));
CL_CHECK(clWaitForEvents(1, &evt));
CL_CHECK(clEnqueueReadBuffer(
queue, data_device, CL_TRUE, offset,
size, data, 0, NULL, NULL));
CL_CHECK(clReleaseMemObject(data_device));
return;
}
if (tensor->type == GGML_TYPE_Q5_1) {
@@ -7123,8 +7346,29 @@ static void ggml_backend_opencl_buffer_get_tensor(ggml_backend_buffer_t buffer,
return;
}
#endif // GGML_OPENCL_USE_ADRENO_KERNELS
// TODO: normal q5_1
(void) extra;
cl_int err;
cl_mem data_device = clCreateBuffer(context, CL_MEM_READ_WRITE,
ggml_nbytes(tensor), NULL, &err);
CL_CHECK(err);
cl_kernel kernel = backend_ctx->kernel_restore_block_q5_1;
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &extra->qs));
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_mem), &extra->qh));
CL_CHECK(clSetKernelArg(kernel, 2, sizeof(cl_mem), &extra->d));
CL_CHECK(clSetKernelArg(kernel, 3, sizeof(cl_mem), &extra->m));
CL_CHECK(clSetKernelArg(kernel, 4, sizeof(cl_mem), &data_device));
size_t global_work_size[] = {(size_t)ggml_nelements(tensor)/ggml_blck_size(tensor->type), 1, 1};
size_t local_work_size[] = {1, 1, 1};
cl_event evt;
CL_CHECK(clEnqueueNDRangeKernel(queue, kernel, 3, NULL,
global_work_size, local_work_size, 0, NULL, &evt));
CL_CHECK(clWaitForEvents(1, &evt));
CL_CHECK(clEnqueueReadBuffer(
queue, data_device, CL_TRUE, offset,
size, data, 0, NULL, NULL));
CL_CHECK(clReleaseMemObject(data_device));
return;
}
if (tensor->type == GGML_TYPE_MXFP4) {
@@ -7661,6 +7905,41 @@ static void ggml_backend_opencl_buffer_get_tensor(ggml_backend_buffer_t buffer,
}
#endif // GGML_OPENCL_SOA_Q
if (tensor->type == GGML_TYPE_BF16) {
GGML_ASSERT(offset % sizeof(ggml_fp16_t) == 0 && size % sizeof(ggml_fp16_t) == 0
&& "Offset and size must be multiples of 2 for bf16 tensors");
ggml_tensor_extra_cl * extra = (ggml_tensor_extra_cl *) tensor->extra;
GGML_ASSERT(extra);
cl_ulong n_elements = size / sizeof(ggml_fp16_t);
cl_ulong off_src = (extra->offset + tensor->view_offs + offset) / sizeof(ggml_fp16_t);
cl_int err;
cl_mem data_device = clCreateBuffer(context, CL_MEM_READ_WRITE, size, NULL, &err);
CL_CHECK(err);
cl_kernel kernel = backend_ctx->kernel_convert_f16_to_bf16;
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &extra->data_device));
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_ulong), &off_src));
CL_CHECK(clSetKernelArg(kernel, 2, sizeof(cl_mem), &data_device));
CL_CHECK(clSetKernelArg(kernel, 3, sizeof(cl_ulong), &n_elements));
size_t global_work_size[] = { (size_t)CEIL_DIV(n_elements, 64)*64, 1, 1 };
size_t local_work_size[] = { 64, 1, 1 };
cl_event evt;
CL_CHECK(clEnqueueNDRangeKernel(queue, kernel, 3, NULL, global_work_size, local_work_size, 0, NULL, &evt));
CL_CHECK(clWaitForEvents(1, &evt));
CL_CHECK(clReleaseEvent(evt));
CL_CHECK(clEnqueueReadBuffer(
queue, data_device, CL_TRUE, 0, size, data, 0, NULL, NULL));
CL_CHECK(clReleaseMemObject(data_device));
return;
}
ggml_tensor_extra_cl * extra = (ggml_tensor_extra_cl *) tensor->extra;
CL_CHECK(clEnqueueReadBuffer(
@@ -7823,6 +8102,8 @@ static ggml_backend_t ggml_backend_opencl_device_init(ggml_backend_dev_t dev, co
/* .context = */ backend_ctx,
};
ggml_backend_opencl_device_context * dev_ctx = (ggml_backend_opencl_device_context *) dev->context;
ggml_opencl_print_backend_info(dev_ctx);
return backend;
GGML_UNUSED(params);
@@ -8148,6 +8429,7 @@ static void ggml_cl_copy_to_contiguous(ggml_backend_t backend, const ggml_tensor
kernel = backend_ctx->kernel_cpy_f32_f32;
break;
case GGML_TYPE_F16:
case GGML_TYPE_BF16: // stored as f16 on device
kernel = backend_ctx->kernel_cpy_f16_f16;
break;
default:
@@ -11108,7 +11390,8 @@ static bool ggml_cl_can_use_adreno_xmem_gemm_f16_f32(
if (backend_ctx->gpu_family != GPU_FAMILY::ADRENO) {
return false;
}
if (src0->type != GGML_TYPE_F16 || src1->type != GGML_TYPE_F32 || dst->type != GGML_TYPE_F32) {
if ((src0->type != GGML_TYPE_F16 && src0->type != GGML_TYPE_BF16) ||
src1->type != GGML_TYPE_F32 || dst->type != GGML_TYPE_F32) {
return false;
}
if (!ggml_is_contiguous(src0) || !ggml_is_contiguous(src1) || !ggml_is_contiguous(dst)) {
@@ -12826,7 +13109,8 @@ static void ggml_cl_mul_mat(ggml_backend_t backend, const ggml_tensor * src0, co
GGML_ASSERT(dst);
GGML_ASSERT(dst->extra);
const enum ggml_type src0t = src0->type;
// bf16 is stored as f16 on device
const enum ggml_type src0t = (src0->type == GGML_TYPE_BF16) ? GGML_TYPE_F16 : src0->type;
const enum ggml_type src1t = src1->type;
ggml_backend_opencl_context *backend_ctx = (ggml_backend_opencl_context *)backend->context;
@@ -12842,6 +13126,8 @@ static void ggml_cl_mul_mat(ggml_backend_t backend, const ggml_tensor * src0, co
#ifdef GGML_OPENCL_SOA_Q
ggml_tensor_extra_cl_q4_0 * extra0_q4_0 = (ggml_tensor_extra_cl_q4_0 *)src0->extra;
ggml_tensor_extra_cl_q4_1 * extra0_q4_1 = (ggml_tensor_extra_cl_q4_1 *)src0->extra;
ggml_tensor_extra_cl_q5_0 * extra0_q5_0 = (ggml_tensor_extra_cl_q5_0 *)src0->extra;
ggml_tensor_extra_cl_q5_1 * extra0_q5_1 = (ggml_tensor_extra_cl_q5_1 *)src0->extra;
ggml_tensor_extra_cl_mxfp4 * extra0_mxfp4 = (ggml_tensor_extra_cl_mxfp4 *)src0->extra;
ggml_tensor_extra_cl_q8_0 * extra0_q8_0 = (ggml_tensor_extra_cl_q8_0 *)src0->extra;
ggml_tensor_extra_cl_iq4_nl * extra0_iq4_nl = (ggml_tensor_extra_cl_iq4_nl *)src0->extra;
@@ -13177,6 +13463,93 @@ static void ggml_cl_mul_mat(ggml_backend_t backend, const ggml_tensor * src0, co
backend_ctx->enqueue_ndrange_kernel(kernel, 3, global_work_size, local_work_size, dst);
return;
}
case GGML_TYPE_Q5_0: {
if (ne11 < 32) {
break;
}
if (!ggml_is_contiguous(src0) || !ggml_is_contiguous(src1)) {
break;
}
kernel = backend_ctx->kernel_mul_mm_q5_0_f32_l4_lm;
nth0 = 128; // calculated as (BM*BN)/(TM*TN)
int batch_stride_a = ne00*ne01;
int batch_stride_b = ne10*ne11;
int batch_stride_d = ne0*ne1;
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &extra0_q5_0->qs));
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_mem), &extra0_q5_0->qh));
CL_CHECK(clSetKernelArg(kernel, 2, sizeof(cl_mem), &extra0_q5_0->d));
CL_CHECK(clSetKernelArg(kernel, 3, sizeof(cl_mem), &extra1->data_device));
CL_CHECK(clSetKernelArg(kernel, 4, sizeof(cl_ulong), &offset1));
CL_CHECK(clSetKernelArg(kernel, 5, sizeof(cl_mem), &extrad->data_device));
CL_CHECK(clSetKernelArg(kernel, 6, sizeof(cl_ulong), &offsetd));
CL_CHECK(clSetKernelArg(kernel, 7, sizeof(int), &ne00));
CL_CHECK(clSetKernelArg(kernel, 8, sizeof(int), &ne01));
CL_CHECK(clSetKernelArg(kernel, 9, sizeof(int), &ne02));
CL_CHECK(clSetKernelArg(kernel, 10, sizeof(int), &ne11));
CL_CHECK(clSetKernelArg(kernel, 11, sizeof(int), &ne12));
CL_CHECK(clSetKernelArg(kernel, 12, sizeof(int), &ne10)); // stride_a
CL_CHECK(clSetKernelArg(kernel, 13, sizeof(int), &ne10)); // stride_b
CL_CHECK(clSetKernelArg(kernel, 14, sizeof(int), &ne01)); // stride_d
CL_CHECK(clSetKernelArg(kernel, 15, sizeof(int), &batch_stride_a));
CL_CHECK(clSetKernelArg(kernel, 16, sizeof(int), &batch_stride_b));
CL_CHECK(clSetKernelArg(kernel, 17, sizeof(int), &batch_stride_d));
CL_CHECK(clSetKernelArg(kernel, 18, sizeof(int), &r2));
CL_CHECK(clSetKernelArg(kernel, 19, sizeof(int), &r3));
// 64 is block tile size BM and BN - change here when BM and BN in the kernel are changed.
size_t global_work_size[] = {(size_t)(CEIL_DIV(ne01, 64)*nth0), (size_t)(CEIL_DIV(ne11, 64)), (size_t)ne12*ne13};
size_t local_work_size[] = {(size_t)nth0, 1, 1};
backend_ctx->enqueue_ndrange_kernel(kernel, 3, global_work_size, local_work_size, dst);
return;
}
case GGML_TYPE_Q5_1: {
if (ne11 < 32) {
break;
}
if (!ggml_is_contiguous(src0) || !ggml_is_contiguous(src1)) {
break;
}
kernel = backend_ctx->kernel_mul_mm_q5_1_f32_l4_lm;
nth0 = 128; // calculated as (BM*BN)/(TM*TN)
int batch_stride_a = ne00*ne01;
int batch_stride_b = ne10*ne11;
int batch_stride_d = ne0*ne1;
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &extra0_q5_1->qs));
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_mem), &extra0_q5_1->qh));
CL_CHECK(clSetKernelArg(kernel, 2, sizeof(cl_mem), &extra0_q5_1->d));
CL_CHECK(clSetKernelArg(kernel, 3, sizeof(cl_mem), &extra0_q5_1->m));
CL_CHECK(clSetKernelArg(kernel, 4, sizeof(cl_mem), &extra1->data_device));
CL_CHECK(clSetKernelArg(kernel, 5, sizeof(cl_ulong), &offset1));
CL_CHECK(clSetKernelArg(kernel, 6, sizeof(cl_mem), &extrad->data_device));
CL_CHECK(clSetKernelArg(kernel, 7, sizeof(cl_ulong), &offsetd));
CL_CHECK(clSetKernelArg(kernel, 8, sizeof(int), &ne00));
CL_CHECK(clSetKernelArg(kernel, 9, sizeof(int), &ne01));
CL_CHECK(clSetKernelArg(kernel, 10, sizeof(int), &ne02));
CL_CHECK(clSetKernelArg(kernel, 11, sizeof(int), &ne11));
CL_CHECK(clSetKernelArg(kernel, 12, sizeof(int), &ne12));
CL_CHECK(clSetKernelArg(kernel, 13, sizeof(int), &ne10)); // stride_a
CL_CHECK(clSetKernelArg(kernel, 14, sizeof(int), &ne10)); // stride_b
CL_CHECK(clSetKernelArg(kernel, 15, sizeof(int), &ne01)); // stride_d
CL_CHECK(clSetKernelArg(kernel, 16, sizeof(int), &batch_stride_a));
CL_CHECK(clSetKernelArg(kernel, 17, sizeof(int), &batch_stride_b));
CL_CHECK(clSetKernelArg(kernel, 18, sizeof(int), &batch_stride_d));
CL_CHECK(clSetKernelArg(kernel, 19, sizeof(int), &r2));
CL_CHECK(clSetKernelArg(kernel, 20, sizeof(int), &r3));
// 64 is block tile size BM and BN - change here when BM and BN in the kernel are changed.
size_t global_work_size[] = {(size_t)(CEIL_DIV(ne01, 64)*nth0), (size_t)(CEIL_DIV(ne11, 64)), (size_t)ne12*ne13};
size_t local_work_size[] = {(size_t)nth0, 1, 1};
backend_ctx->enqueue_ndrange_kernel(kernel, 3, global_work_size, local_work_size, dst);
return;
}
case GGML_TYPE_Q8_0: {
if (ne11 < 32) {
break;
@@ -13713,6 +14086,137 @@ static void ggml_cl_mul_mat(ggml_backend_t backend, const ggml_tensor * src0, co
#endif // GGML_OPENCL_SOA_Q
break;
}
case GGML_TYPE_Q5_0: {
#ifdef GGML_OPENCL_SOA_Q
if (backend_ctx->gpu_family == INTEL) {
nth0 = 16;
nth1 = 1;
ndst = 4;
} else if (backend_ctx->gpu_family == ADRENO) {
nth0 = 64;
nth1 = 1;
ndst = 4;
} else {
GGML_ASSERT(false && "TODO: Unknown GPU");
}
kernel = backend_ctx->kernel_mul_mv_q5_0_f32_flat;
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &extra0_q5_0->qs));
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_mem), &extra0_q5_0->qh));
CL_CHECK(clSetKernelArg(kernel, 2, sizeof(cl_mem), &extra0_q5_0->d));
CL_CHECK(clSetKernelArg(kernel, 3, sizeof(cl_mem), &extra1->data_device));
CL_CHECK(clSetKernelArg(kernel, 4, sizeof(cl_ulong), &offset1));
CL_CHECK(clSetKernelArg(kernel, 5, sizeof(cl_mem), &extrad->data_device));
CL_CHECK(clSetKernelArg(kernel, 6, sizeof(cl_ulong), &offsetd));
CL_CHECK(clSetKernelArg(kernel, 7, sizeof(int), &ne00));
CL_CHECK(clSetKernelArg(kernel, 8, sizeof(int), &ne01));
CL_CHECK(clSetKernelArg(kernel, 9, sizeof(int), &ne02));
CL_CHECK(clSetKernelArg(kernel, 10, sizeof(int), &ne10));
CL_CHECK(clSetKernelArg(kernel, 11, sizeof(int), &ne12));
CL_CHECK(clSetKernelArg(kernel, 12, sizeof(int), &ne0));
CL_CHECK(clSetKernelArg(kernel, 13, sizeof(int), &ne1));
CL_CHECK(clSetKernelArg(kernel, 14, sizeof(int), &r2));
CL_CHECK(clSetKernelArg(kernel, 15, sizeof(int), &r3));
#else
if (backend_ctx->gpu_family == INTEL) {
nth0 = 16;
nth1 = 1;
ndst = 4;
} else if (backend_ctx->gpu_family == ADRENO) {
nth0 = 64;
nth1 = 1;
ndst = 4;
} else {
GGML_ASSERT(false && "TODO: Unknown GPU");
}
kernel = backend_ctx->kernel_mul_mv_q5_0_f32;
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &extra0->data_device));
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_ulong), &offset0));
CL_CHECK(clSetKernelArg(kernel, 2, sizeof(cl_mem), &extra1->data_device));
CL_CHECK(clSetKernelArg(kernel, 3, sizeof(cl_ulong), &offset1));
CL_CHECK(clSetKernelArg(kernel, 4, sizeof(cl_mem), &extrad->data_device));
CL_CHECK(clSetKernelArg(kernel, 5, sizeof(cl_ulong), &offsetd));
CL_CHECK(clSetKernelArg(kernel, 6, sizeof(int), &ne00));
CL_CHECK(clSetKernelArg(kernel, 7, sizeof(int), &ne01));
CL_CHECK(clSetKernelArg(kernel, 8, sizeof(int), &ne02));
CL_CHECK(clSetKernelArg(kernel, 9, sizeof(int), &ne10));
CL_CHECK(clSetKernelArg(kernel, 10, sizeof(int), &ne12));
CL_CHECK(clSetKernelArg(kernel, 11, sizeof(int), &ne0));
CL_CHECK(clSetKernelArg(kernel, 12, sizeof(int), &ne1));
CL_CHECK(clSetKernelArg(kernel, 13, sizeof(int), &r2));
CL_CHECK(clSetKernelArg(kernel, 14, sizeof(int), &r3));
#endif // GGML_OPENCL_SOA_Q
break;
}
case GGML_TYPE_Q5_1: {
#ifdef GGML_OPENCL_SOA_Q
if (backend_ctx->gpu_family == INTEL) {
nth0 = 16;
nth1 = 1;
ndst = 4;
} else if (backend_ctx->gpu_family == ADRENO) {
nth0 = 64;
nth1 = 1;
ndst = 4;
} else {
GGML_ASSERT(false && "TODO: Unknown GPU");
}
kernel = backend_ctx->kernel_mul_mv_q5_1_f32_flat;
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &extra0_q5_1->qs));
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_mem), &extra0_q5_1->qh));
CL_CHECK(clSetKernelArg(kernel, 2, sizeof(cl_mem), &extra0_q5_1->d));
CL_CHECK(clSetKernelArg(kernel, 3, sizeof(cl_mem), &extra0_q5_1->m));
CL_CHECK(clSetKernelArg(kernel, 4, sizeof(cl_mem), &extra1->data_device));
CL_CHECK(clSetKernelArg(kernel, 5, sizeof(cl_ulong), &offset1));
CL_CHECK(clSetKernelArg(kernel, 6, sizeof(cl_mem), &extrad->data_device));
CL_CHECK(clSetKernelArg(kernel, 7, sizeof(cl_ulong), &offsetd));
CL_CHECK(clSetKernelArg(kernel, 8, sizeof(int), &ne00));
CL_CHECK(clSetKernelArg(kernel, 9, sizeof(int), &ne01));
CL_CHECK(clSetKernelArg(kernel, 10, sizeof(int), &ne02));
CL_CHECK(clSetKernelArg(kernel, 11, sizeof(int), &ne10));
CL_CHECK(clSetKernelArg(kernel, 12, sizeof(int), &ne12));
CL_CHECK(clSetKernelArg(kernel, 13, sizeof(int), &ne0));
CL_CHECK(clSetKernelArg(kernel, 14, sizeof(int), &ne1));
CL_CHECK(clSetKernelArg(kernel, 15, sizeof(int), &r2));
CL_CHECK(clSetKernelArg(kernel, 16, sizeof(int), &r3));
#else
if (backend_ctx->gpu_family == INTEL) {
nth0 = 16;
nth1 = 1;
ndst = 4;
} else if (backend_ctx->gpu_family == ADRENO) {
nth0 = 64;
nth1 = 1;
ndst = 4;
} else {
GGML_ASSERT(false && "TODO: Unknown GPU");
}
kernel = backend_ctx->kernel_mul_mv_q5_1_f32;
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &extra0->data_device));
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_ulong), &offset0));
CL_CHECK(clSetKernelArg(kernel, 2, sizeof(cl_mem), &extra1->data_device));
CL_CHECK(clSetKernelArg(kernel, 3, sizeof(cl_ulong), &offset1));
CL_CHECK(clSetKernelArg(kernel, 4, sizeof(cl_mem), &extrad->data_device));
CL_CHECK(clSetKernelArg(kernel, 5, sizeof(cl_ulong), &offsetd));
CL_CHECK(clSetKernelArg(kernel, 6, sizeof(int), &ne00));
CL_CHECK(clSetKernelArg(kernel, 7, sizeof(int), &ne01));
CL_CHECK(clSetKernelArg(kernel, 8, sizeof(int), &ne02));
CL_CHECK(clSetKernelArg(kernel, 9, sizeof(int), &ne10));
CL_CHECK(clSetKernelArg(kernel, 10, sizeof(int), &ne12));
CL_CHECK(clSetKernelArg(kernel, 11, sizeof(int), &ne0));
CL_CHECK(clSetKernelArg(kernel, 12, sizeof(int), &ne1));
CL_CHECK(clSetKernelArg(kernel, 13, sizeof(int), &r2));
CL_CHECK(clSetKernelArg(kernel, 14, sizeof(int), &r3));
#endif // GGML_OPENCL_SOA_Q
break;
}
case GGML_TYPE_Q8_0: {
#ifdef GGML_OPENCL_SOA_Q
kernel = backend_ctx->kernel_mul_mv_q8_0_f32_flat;
@@ -14153,6 +14657,8 @@ static void ggml_cl_mul_mat(ggml_backend_t backend, const ggml_tensor * src0, co
if (src0t == GGML_TYPE_Q4_0 || src0t == GGML_TYPE_MXFP4 ||
src0t == GGML_TYPE_Q4_1 ||
src0t == GGML_TYPE_Q5_0 ||
src0t == GGML_TYPE_Q5_1 ||
src0t == GGML_TYPE_Q8_0 ||
src0t == GGML_TYPE_IQ4_NL ||
src0t == GGML_TYPE_Q2_K) {
+142
View File
@@ -117,6 +117,48 @@ struct block_iq4_nl
uint8_t qs[QK4_NL / 2];
};
//------------------------------------------------------------------------------
// bf16 to f16
//------------------------------------------------------------------------------
kernel void kernel_convert_bf16_to_f16(
global const ushort * src,
global half * dst,
ulong off_dst,
ulong n
) {
uint i = get_global_id(0);
if (i >= n) {
return;
}
dst[i + off_dst] = (half) as_float((uint) src[i] << 16);
}
//------------------------------------------------------------------------------
// f16 to bf16
//------------------------------------------------------------------------------
kernel void kernel_convert_f16_to_bf16(
global const half * src,
ulong off_src,
global ushort * dst,
ulong n
) {
uint i = get_global_id(0);
if (i >= n) {
return;
}
float f = (float) src[i + off_src];
uint bits = as_uint(f);
if ((bits & 0x7fffffffu) > 0x7f800000u) {
// nan to quiet nan
dst[i] = (ushort)((bits >> 16) | 0x40u);
} else {
uint rounded = bits + 0x7fffu + ((bits >> 16) & 1u);
dst[i] = (ushort)(rounded >> 16);
}
}
//------------------------------------------------------------------------------
// kernel_convert_block_q4_0
// Convert the block_q4_0 format to 2 separate arrays (AOS -> SOA).
@@ -495,6 +537,53 @@ kernel void kernel_restore_block_q4_1_trans4_ns(
((__global ushort8 *)(&(b->qs[0])))[0] = pre_block;
}
//------------------------------------------------------------------------------
// kernel_convert_block_q5_0
// Convert the block_q5_0 format to 3 separate arrays (AOS -> SOA).
// This kernel does not deshuffle the bits.
//------------------------------------------------------------------------------
kernel void kernel_convert_block_q5_0(
global struct block_q5_0 * src0,
global uchar * dst_qs,
global uint * dst_qh,
global half * dst_d,
ulong n_blk
) {
if (get_global_id(0) >= n_blk) {
return;
}
global struct block_q5_0 * b = (global struct block_q5_0 *) src0 + get_global_id(0);
global uchar * qs = (global uchar *) dst_qs + (QK5_0/2)*get_global_id(0);
global uint * qh = (global uint *) dst_qh + get_global_id(0);
global half * d = (global half *) dst_d + get_global_id(0);
*d = b->d;
*qh = *((global uint *)(b->qh));
for (int i = 0; i < QK5_0/2; ++i) {
qs[i] = b->qs[i];
}
}
kernel void kernel_restore_block_q5_0(
global uchar * src_qs,
global uint * src_qh,
global half * src_d,
global struct block_q5_0 * dst
) {
global struct block_q5_0 * b = (global struct block_q5_0 *) dst + get_global_id(0);
global uchar * qs = (global uchar *) src_qs + (QK5_0/2)*get_global_id(0);
global uint * qh = (global uint *) src_qh + get_global_id(0);
global half * d = (global half *) src_d + get_global_id(0);
b->d = *d;
*((global uint *)(b->qh)) = *qh;
for (int i = 0; i < QK5_0/2; ++i) {
b->qs[i] = qs[i];
}
}
kernel void kernel_convert_block_q5_0_trans4_ns(
__global struct block_q5_0 * src0,
__global uint * dst_qs,
@@ -594,6 +683,59 @@ kernel void kernel_restore_block_q5_0_trans4_ns(
((__global ushort8 *)(&(b->qs[0])))[0] = pre_block;
}
//------------------------------------------------------------------------------
// kernel_convert_block_q5_1
// Convert the block_q5_1 format to 4 separate arrays (AOS -> SOA).
// This kernel does not deshuffle the bits.
//------------------------------------------------------------------------------
kernel void kernel_convert_block_q5_1(
global struct block_q5_1 * src0,
global uchar * dst_qs,
global uint * dst_qh,
global half * dst_d,
global half * dst_m,
ulong n_blk
) {
if (get_global_id(0) >= n_blk) {
return;
}
global struct block_q5_1 * b = (global struct block_q5_1 *) src0 + get_global_id(0);
global uchar * qs = (global uchar *) dst_qs + (QK5_1/2)*get_global_id(0);
global uint * qh = (global uint *) dst_qh + get_global_id(0);
global half * d = (global half *) dst_d + get_global_id(0);
global half * m = (global half *) dst_m + get_global_id(0);
*d = b->d;
*m = b->m;
*qh = *((global uint *)(b->qh));
for (int i = 0; i < QK5_1/2; ++i) {
qs[i] = b->qs[i];
}
}
kernel void kernel_restore_block_q5_1(
global uchar * src_qs,
global uint * src_qh,
global half * src_d,
global half * src_m,
global struct block_q5_1 * dst
) {
global struct block_q5_1 * b = (global struct block_q5_1 *) dst + get_global_id(0);
global uchar * qs = (global uchar *) src_qs + (QK5_1/2)*get_global_id(0);
global uint * qh = (global uint *) src_qh + get_global_id(0);
global half * d = (global half *) src_d + get_global_id(0);
global half * m = (global half *) src_m + get_global_id(0);
b->d = *d;
b->m = *m;
*((global uint *)(b->qh)) = *qh;
for (int i = 0; i < QK5_1/2; ++i) {
b->qs[i] = qs[i];
}
}
kernel void kernel_convert_block_q5_1_trans4_ns(
__global struct block_q5_1 * src0,
__global uint * dst_qs,
@@ -0,0 +1,173 @@
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
#define LOAD_VEC_A 8
#define LOAD_VEC_B 4
#define BM 64
#define BN 64
#define BK 32
#define TM 4
#define TN 8
kernel void kernel_mul_mm_q5_0_f32_l4_lm(
global uchar4 * src0_qs,
global uint * src0_qh,
global half * src0_d,
global float4 * src1,
ulong offset1,
global float * dst,
ulong offsetd,
int ne00,
int ne01,
int ne02,
int ne11,
int ne12,
int stride_a,
int stride_b,
int stride_d,
int batch_stride_a,
int batch_stride_b,
int batch_stride_d,
int r2,
int r3
) {
src1 = (global float4*)((global char*)src1 + offset1);
dst = (global float *)((global char*)dst + offsetd);
local float buf_a[BM * BK];
local float buf_b[BN * BK];
const int batch_idx = get_global_id(2);
const int i13 = batch_idx / ne12;
const int i12 = batch_idx % ne12;
const int i03 = i13 / r3;
const int i02 = i12 / r2;
const int batch_idx_a = i03 * ne02 + i02;
const int ir = get_group_id(0);
const int ic = get_group_id(1);
const int tid = get_local_id(0);
const int th_r = tid % (BM / TM);
const int th_c = tid / (BM / TM);
const int loadr_a = get_local_id(0) % (BK / LOAD_VEC_A);
const int loadc_a = get_local_id(0) / (BK / LOAD_VEC_A);
const int loadr_b = get_local_id(0) % (BK / LOAD_VEC_B);
const int loadc_b = get_local_id(0) / (BK / LOAD_VEC_B);
const int loadstride_a = get_local_size(0) * LOAD_VEC_A / BK;
const int loadstride_b = get_local_size(0) * LOAD_VEC_B / BK;
int pos_a = (batch_idx_a * batch_stride_a + ir * BM * stride_a) / LOAD_VEC_A;
int pos_b = (batch_idx * batch_stride_b + ic * BN * stride_b) / LOAD_VEC_B;
float sums[TM * TN];
float cache_a[TM];
float cache_b[TN];
for (int i = 0; i < TM * TN; i++) {
sums[i] = 0.0f;
}
for (int block = 0; block < ne00; block += BK) {
for (int l = 0; l < BM; l += loadstride_a) {
if (ir*BM + loadc_a + l < ne01) {
int idx = pos_a + (loadc_a + l) * stride_a / LOAD_VEC_A + loadr_a;
int ib = idx / 4;
int iqs = idx % 4;
float d = (float)src0_d[ib];
uint qh_val = src0_qh[ib];
global uchar4 * qs_ptr = src0_qs + ib*4 + iqs;
uchar4 q = *qs_ptr;
uint qh_lo = qh_val >> (iqs * 4);
uint qh_hi = qh_val >> (iqs * 4 + 16);
uchar4 b_lo = (uchar4)((uchar)qh_lo, (uchar)(qh_lo >> 1), (uchar)(qh_lo >> 2), (uchar)(qh_lo >> 3)) & (uchar)1;
uchar4 b_hi = (uchar4)((uchar)qh_hi, (uchar)(qh_hi >> 1), (uchar)(qh_hi >> 2), (uchar)(qh_hi >> 3)) & (uchar)1;
float4 v1 = (convert_float4((q & (uchar)0x0F) | (b_lo << (uchar)4)) - 16.0f) * d;
float4 v2 = (convert_float4((q >> (uchar)4) | (b_hi << (uchar)4)) - 16.0f) * d;
buf_a[(loadr_a * 4 + 0) * BM + loadc_a + l] = v1.s0;
buf_a[(loadr_a * 4 + 1) * BM + loadc_a + l] = v1.s1;
buf_a[(loadr_a * 4 + 2) * BM + loadc_a + l] = v1.s2;
buf_a[(loadr_a * 4 + 3) * BM + loadc_a + l] = v1.s3;
buf_a[(loadr_a * 4 + 16) * BM + loadc_a + l] = v2.s0;
buf_a[(loadr_a * 4 + 17) * BM + loadc_a + l] = v2.s1;
buf_a[(loadr_a * 4 + 18) * BM + loadc_a + l] = v2.s2;
buf_a[(loadr_a * 4 + 19) * BM + loadc_a + l] = v2.s3;
} else {
buf_a[(loadr_a * 4 + 0) * BM + loadc_a + l] = 0.0f;
buf_a[(loadr_a * 4 + 1) * BM + loadc_a + l] = 0.0f;
buf_a[(loadr_a * 4 + 2) * BM + loadc_a + l] = 0.0f;
buf_a[(loadr_a * 4 + 3) * BM + loadc_a + l] = 0.0f;
buf_a[(loadr_a * 4 + 16) * BM + loadc_a + l] = 0.0f;
buf_a[(loadr_a * 4 + 17) * BM + loadc_a + l] = 0.0f;
buf_a[(loadr_a * 4 + 18) * BM + loadc_a + l] = 0.0f;
buf_a[(loadr_a * 4 + 19) * BM + loadc_a + l] = 0.0f;
}
}
for (int l = 0; l < BN; l += loadstride_b) {
if (ic*BN + loadc_b + l < ne11) {
int idx = pos_b + (loadc_b + l) * stride_b / LOAD_VEC_B + loadr_b;
buf_b[(loadr_b * LOAD_VEC_B + 0) * BN + loadc_b + l] = src1[idx].s0;
buf_b[(loadr_b * LOAD_VEC_B + 1) * BN + loadc_b + l] = src1[idx].s1;
buf_b[(loadr_b * LOAD_VEC_B + 2) * BN + loadc_b + l] = src1[idx].s2;
buf_b[(loadr_b * LOAD_VEC_B + 3) * BN + loadc_b + l] = src1[idx].s3;
} else {
buf_b[(loadr_b * LOAD_VEC_B + 0) * BN + loadc_b + l] = 0.0f;
buf_b[(loadr_b * LOAD_VEC_B + 1) * BN + loadc_b + l] = 0.0f;
buf_b[(loadr_b * LOAD_VEC_B + 2) * BN + loadc_b + l] = 0.0f;
buf_b[(loadr_b * LOAD_VEC_B + 3) * BN + loadc_b + l] = 0.0f;
}
}
barrier(CLK_LOCAL_MEM_FENCE);
pos_a += BK / LOAD_VEC_A;
pos_b += BK / LOAD_VEC_B;
for (int i = 0; i < BK; i++) {
for (int j = 0; j < TM; j++) {
cache_a[j] = buf_a[(i) * BM + th_r * TM + j];
}
for (int j = 0; j < TN; j++) {
cache_b[j] = buf_b[(i) * BN + th_c * TN + j];
}
for (int cc = 0; cc < TN; cc++) {
for (int cr = 0; cr < TM; cr++) {
const int sums_idx = cc*TM + cr;
sums[sums_idx] = mad(cache_a[cr], cache_b[cc], sums[sums_idx]);
}
}
}
barrier(CLK_LOCAL_MEM_FENCE);
}
const int dr = ir * BM + th_r * TM;
const int dc = ic * BN + th_c * TN;
const int offsets = batch_idx * batch_stride_d;
for (int cc = 0; cc < TN; cc++) {
for (int cr = 0; cr < TM; cr++) {
if (dr + cr < ne01 && dc + cc < ne11) {
dst[offsets + (dc + cc) * stride_d + dr + cr] = sums[cc * TM + cr];
}
}
}
}
@@ -0,0 +1,175 @@
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
#define LOAD_VEC_A 8
#define LOAD_VEC_B 4
#define BM 64
#define BN 64
#define BK 32
#define TM 4
#define TN 8
kernel void kernel_mul_mm_q5_1_f32_l4_lm(
global uchar4 * src0_qs,
global uint * src0_qh,
global half * src0_d,
global half * src0_m,
global float4 * src1,
ulong offset1,
global float * dst,
ulong offsetd,
int ne00,
int ne01,
int ne02,
int ne11,
int ne12,
int stride_a,
int stride_b,
int stride_d,
int batch_stride_a,
int batch_stride_b,
int batch_stride_d,
int r2,
int r3
) {
src1 = (global float4*)((global char*)src1 + offset1);
dst = (global float *)((global char*)dst + offsetd);
local float buf_a[BM * BK];
local float buf_b[BN * BK];
const int batch_idx = get_global_id(2);
const int i13 = batch_idx / ne12;
const int i12 = batch_idx % ne12;
const int i03 = i13 / r3;
const int i02 = i12 / r2;
const int batch_idx_a = i03 * ne02 + i02;
const int ir = get_group_id(0);
const int ic = get_group_id(1);
const int tid = get_local_id(0);
const int th_r = tid % (BM / TM);
const int th_c = tid / (BM / TM);
const int loadr_a = get_local_id(0) % (BK / LOAD_VEC_A);
const int loadc_a = get_local_id(0) / (BK / LOAD_VEC_A);
const int loadr_b = get_local_id(0) % (BK / LOAD_VEC_B);
const int loadc_b = get_local_id(0) / (BK / LOAD_VEC_B);
const int loadstride_a = get_local_size(0) * LOAD_VEC_A / BK;
const int loadstride_b = get_local_size(0) * LOAD_VEC_B / BK;
int pos_a = (batch_idx_a * batch_stride_a + ir * BM * stride_a) / LOAD_VEC_A;
int pos_b = (batch_idx * batch_stride_b + ic * BN * stride_b) / LOAD_VEC_B;
float sums[TM * TN];
float cache_a[TM];
float cache_b[TN];
for (int i = 0; i < TM * TN; i++) {
sums[i] = 0.0f;
}
for (int block = 0; block < ne00; block += BK) {
for (int l = 0; l < BM; l += loadstride_a) {
if (ir*BM + loadc_a + l < ne01) {
int idx = pos_a + (loadc_a + l) * stride_a / LOAD_VEC_A + loadr_a;
int ib = idx / 4;
int iqs = idx % 4;
float d = (float)src0_d[ib];
float m = (float)src0_m[ib];
uint qh_val = src0_qh[ib];
global uchar4 * qs = src0_qs + ib*4 + iqs;
uchar4 q = *qs;
uint qh_lo = qh_val >> (iqs * 4);
uint qh_hi = qh_val >> (iqs * 4 + 16);
uchar4 b_lo = (uchar4)((uchar)qh_lo, (uchar)(qh_lo >> 1), (uchar)(qh_lo >> 2), (uchar)(qh_lo >> 3)) & (uchar)1;
uchar4 b_hi = (uchar4)((uchar)qh_hi, (uchar)(qh_hi >> 1), (uchar)(qh_hi >> 2), (uchar)(qh_hi >> 3)) & (uchar)1;
float4 v1 = convert_float4((q & (uchar)0x0F) | (b_lo << (uchar)4)) * d + m;
float4 v2 = convert_float4((q >> (uchar)4) | (b_hi << (uchar)4)) * d + m;
buf_a[(loadr_a * 4 + 0) * BM + loadc_a + l] = v1.s0;
buf_a[(loadr_a * 4 + 1) * BM + loadc_a + l] = v1.s1;
buf_a[(loadr_a * 4 + 2) * BM + loadc_a + l] = v1.s2;
buf_a[(loadr_a * 4 + 3) * BM + loadc_a + l] = v1.s3;
buf_a[(loadr_a * 4 + 16) * BM + loadc_a + l] = v2.s0;
buf_a[(loadr_a * 4 + 17) * BM + loadc_a + l] = v2.s1;
buf_a[(loadr_a * 4 + 18) * BM + loadc_a + l] = v2.s2;
buf_a[(loadr_a * 4 + 19) * BM + loadc_a + l] = v2.s3;
} else {
buf_a[(loadr_a * 4 + 0) * BM + loadc_a + l] = 0.0f;
buf_a[(loadr_a * 4 + 1) * BM + loadc_a + l] = 0.0f;
buf_a[(loadr_a * 4 + 2) * BM + loadc_a + l] = 0.0f;
buf_a[(loadr_a * 4 + 3) * BM + loadc_a + l] = 0.0f;
buf_a[(loadr_a * 4 + 16) * BM + loadc_a + l] = 0.0f;
buf_a[(loadr_a * 4 + 17) * BM + loadc_a + l] = 0.0f;
buf_a[(loadr_a * 4 + 18) * BM + loadc_a + l] = 0.0f;
buf_a[(loadr_a * 4 + 19) * BM + loadc_a + l] = 0.0f;
}
}
for (int l = 0; l < BN; l += loadstride_b) {
if (ic*BN + loadc_b + l < ne11) {
int idx = pos_b + (loadc_b + l) * stride_b / LOAD_VEC_B + loadr_b;
buf_b[(loadr_b * LOAD_VEC_B + 0) * BN + loadc_b + l] = src1[idx].s0;
buf_b[(loadr_b * LOAD_VEC_B + 1) * BN + loadc_b + l] = src1[idx].s1;
buf_b[(loadr_b * LOAD_VEC_B + 2) * BN + loadc_b + l] = src1[idx].s2;
buf_b[(loadr_b * LOAD_VEC_B + 3) * BN + loadc_b + l] = src1[idx].s3;
} else {
buf_b[(loadr_b * LOAD_VEC_B + 0) * BN + loadc_b + l] = 0.0f;
buf_b[(loadr_b * LOAD_VEC_B + 1) * BN + loadc_b + l] = 0.0f;
buf_b[(loadr_b * LOAD_VEC_B + 2) * BN + loadc_b + l] = 0.0f;
buf_b[(loadr_b * LOAD_VEC_B + 3) * BN + loadc_b + l] = 0.0f;
}
}
barrier(CLK_LOCAL_MEM_FENCE);
pos_a += BK / LOAD_VEC_A;
pos_b += BK / LOAD_VEC_B;
for (int i = 0; i < BK; i++) {
for (int j = 0; j < TM; j++) {
cache_a[j] = buf_a[(i) * BM + th_r * TM + j];
}
for (int j = 0; j < TN; j++) {
cache_b[j] = buf_b[(i) * BN + th_c * TN + j];
}
for (int cc = 0; cc < TN; cc++) {
for (int cr = 0; cr < TM; cr++) {
const int sums_idx = cc*TM + cr;
sums[sums_idx] = mad(cache_a[cr], cache_b[cc], sums[sums_idx]);
}
}
}
barrier(CLK_LOCAL_MEM_FENCE);
}
const int dr = ir * BM + th_r * TM;
const int dc = ic * BN + th_c * TN;
const int offsets = batch_idx * batch_stride_d;
for (int cc = 0; cc < TN; cc++) {
for (int cr = 0; cr < TM; cr++) {
if (dr + cr < ne01 && dc + cc < ne11) {
dst[offsets + (dc + cc) * stride_d + dr + cr] = sums[cc * TM + cr];
}
}
}
}
@@ -0,0 +1,241 @@
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
#ifdef cl_intel_subgroups
#pragma OPENCL EXTENSION cl_intel_subgroups : enable
#else
#pragma OPENCL EXTENSION cl_khr_subgroups : enable
#endif
#ifdef cl_intel_required_subgroup_size
#pragma OPENCL EXTENSION cl_intel_required_subgroup_size : enable
#define INTEL_GPU 1
#define REQD_SUBGROUP_SIZE_16 __attribute__((intel_reqd_sub_group_size(16)))
#define REQD_SUBGROUP_SIZE_32 __attribute__((intel_reqd_sub_group_size(32)))
#elif defined(cl_qcom_reqd_sub_group_size)
#pragma OPENCL EXTENSION cl_qcom_reqd_sub_group_size : enable
#define ADRENO_GPU 1
#define REQD_SUBGROUP_SIZE_64 __attribute__((qcom_reqd_sub_group_size("half")))
#define REQD_SUBGROUP_SIZE_128 __attribute__((qcom_reqd_sub_group_size("full")))
#endif
#define QK5_0 32
struct block_q5_0 {
half d;
uchar qh[4];
uchar qs[QK5_0 / 2];
};
inline float block_q5_0_dot_y(
global const struct block_q5_0 * qb_curr,
float sumy,
float16 yl,
int il,
global const float * yb
) {
float d = qb_curr->d;
float4 acc = (float4)(0.0f, 0.0f, 0.0f, 0.0f);
global const ushort * qs = ((global const ushort *)((global const uchar *) qb_curr + 6 + il));
acc.s0 += yl.s0 * (qs[0] & 0x000F);
acc.s0 += yl.s1 * (qs[0] & 0x0F00);
acc.s0 += yl.s8 * (qs[0] & 0x00F0);
acc.s3 += yl.s9 * (qs[0] & 0xF000);
acc.s0 += yl.s2 * (qs[1] & 0x000F);
acc.s1 += yl.s3 * (qs[1] & 0x0F00);
acc.s2 += yl.sa * (qs[1] & 0x00F0);
acc.s3 += yl.sb * (qs[1] & 0xF000);
acc.s0 += yl.s4 * (qs[2] & 0x000F);
acc.s1 += yl.s5 * (qs[2] & 0x0F00);
acc.s2 += yl.sc * (qs[2] & 0x00F0);
acc.s3 += yl.sd * (qs[2] & 0xF000);
acc.s0 += yl.s6 * (qs[3] & 0x000F);
acc.s1 += yl.s7 * (qs[3] & 0x0F00);
acc.s2 += yl.se * (qs[3] & 0x00F0);
acc.s3 += yl.sf * (qs[3] & 0xF000);
uint qh_val = *((global const uint *)((global const uchar *) qb_curr + 2));
uchar qh_lo = (uchar)((qh_val >> il) & 0xFF);
uchar qh_hi = (uchar)((qh_val >> (il + 16)) & 0xFF);
float qh_sum = 0.0f;
qh_sum += yb[0] * (float)((qh_lo >> 0) & 1);
qh_sum += yb[1] * (float)((qh_lo >> 1) & 1);
qh_sum += yb[2] * (float)((qh_lo >> 2) & 1);
qh_sum += yb[3] * (float)((qh_lo >> 3) & 1);
qh_sum += yb[4] * (float)((qh_lo >> 4) & 1);
qh_sum += yb[5] * (float)((qh_lo >> 5) & 1);
qh_sum += yb[6] * (float)((qh_lo >> 6) & 1);
qh_sum += yb[7] * (float)((qh_lo >> 7) & 1);
qh_sum += yb[16] * (float)((qh_hi >> 0) & 1);
qh_sum += yb[17] * (float)((qh_hi >> 1) & 1);
qh_sum += yb[18] * (float)((qh_hi >> 2) & 1);
qh_sum += yb[19] * (float)((qh_hi >> 3) & 1);
qh_sum += yb[20] * (float)((qh_hi >> 4) & 1);
qh_sum += yb[21] * (float)((qh_hi >> 5) & 1);
qh_sum += yb[22] * (float)((qh_hi >> 6) & 1);
qh_sum += yb[23] * (float)((qh_hi >> 7) & 1);
return d * (acc.s0 + acc.s1 + acc.s2 + acc.s3 + 16.0f * qh_sum - 16.0f * sumy);
}
#undef N_DST
#undef N_SIMDGROUP
#undef N_SIMDWIDTH
#ifdef INTEL_GPU
#define N_DST 4 // each subgroup works on 4 rows
#define N_SIMDGROUP 1 // number of subgroups in a thread group
#define N_SIMDWIDTH 16 // assuming subgroup size is 16
#elif defined (ADRENO_GPU)
#define N_DST 4
#define N_SIMDGROUP 1
#define N_SIMDWIDTH 64
#endif
inline void mul_vec_q_n_f32(
global void * src0,
global float * src1,
global float * dst,
int ne00,
int ne01,
int ne02,
int ne10,
int ne12,
int ne0,
int ne1,
int r2,
int r3
) {
const ulong nb = ne00/QK5_0;
int r0 = get_group_id(0);
int r1 = get_group_id(1);
int im = get_group_id(2);
int first_row = (r0 * N_SIMDGROUP + get_sub_group_id()) * N_DST;
int i12 = im%ne12;
int i13 = im/ne12;
ulong offset0 = first_row * nb + (i12/r2)*(nb*ne01) + (i13/r3)*(nb*ne01*ne02);
global struct block_q5_0 * x = (global struct block_q5_0 *) src0 + offset0;
global float * y = (global float *) src1 + r1*ne10 + im*ne00*ne1;
float16 yl;
float4 sumf = (float4)(0.f, 0.f, 0.f, 0.f);
int ix = get_sub_group_local_id()/2;
int il = 8*(get_sub_group_local_id()%2);
global float * yb = y + ix * QK5_0 + il;
for (int ib = ix; ib < nb; ib += N_SIMDWIDTH/2) {
float sumy = 0;
sumy += yb[0];
sumy += yb[1];
sumy += yb[2];
sumy += yb[3];
sumy += yb[4];
sumy += yb[5];
sumy += yb[6];
sumy += yb[7];
sumy += yb[16];
sumy += yb[17];
sumy += yb[18];
sumy += yb[19];
sumy += yb[20];
sumy += yb[21];
sumy += yb[22];
sumy += yb[23];
yl.s0 = yb[0];
yl.s1 = yb[1]/256.f;
yl.s2 = yb[2];
yl.s3 = yb[3]/256.f;
yl.s4 = yb[4];
yl.s5 = yb[5]/256.f;
yl.s6 = yb[6];
yl.s7 = yb[7]/256.f;
yl.s8 = yb[16]/16.f;
yl.s9 = yb[17]/4096.f;
yl.sa = yb[18]/16.f;
yl.sb = yb[19]/4096.f;
yl.sc = yb[20]/16.f;
yl.sd = yb[21]/4096.f;
yl.se = yb[22]/16.f;
yl.sf = yb[23]/4096.f;
sumf.s0 += block_q5_0_dot_y(x+ib+0*nb, sumy, yl, il, yb);
sumf.s1 += block_q5_0_dot_y(x+ib+1*nb, sumy, yl, il, yb);
sumf.s2 += block_q5_0_dot_y(x+ib+2*nb, sumy, yl, il, yb);
sumf.s3 += block_q5_0_dot_y(x+ib+3*nb, sumy, yl, il, yb);
yb += QK5_0 * (N_SIMDWIDTH/2);
}
float4 tot = (float4)(
sub_group_reduce_add(sumf.s0), sub_group_reduce_add(sumf.s1),
sub_group_reduce_add(sumf.s2), sub_group_reduce_add(sumf.s3)
);
if (get_sub_group_local_id() == 0) {
if (first_row + 0 < ne01) {
dst[r1*ne0 + im*ne0*ne1 + first_row + 0] = tot.s0;
}
if (first_row + 1 < ne01) {
dst[r1*ne0 + im*ne0*ne1 + first_row + 1] = tot.s1;
}
if (first_row + 2 < ne01) {
dst[r1*ne0 + im*ne0*ne1 + first_row + 2] = tot.s2;
}
if (first_row + 3 < ne01) {
dst[r1*ne0 + im*ne0*ne1 + first_row + 3] = tot.s3;
}
}
}
#ifdef INTEL_GPU
REQD_SUBGROUP_SIZE_16
#elif defined (ADRENO_GPU)
REQD_SUBGROUP_SIZE_64
#endif
kernel void kernel_mul_mv_q5_0_f32(
global void * src0,
ulong offset0,
global float * src1,
ulong offset1,
global float * dst,
ulong offsetd,
int ne00,
int ne01,
int ne02,
int ne10,
int ne12,
int ne0,
int ne1,
int r2,
int r3
) {
src0 = (global void*)((global char*)src0 + offset0);
src1 = (global float*)((global char*)src1 + offset1);
dst = (global float*)((global char*)dst + offsetd);
mul_vec_q_n_f32(src0, src1, dst, ne00, ne01, ne02, ne10, ne12, ne0, ne1, r2, r3);
}
@@ -0,0 +1,243 @@
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
#ifdef cl_intel_subgroups
#pragma OPENCL EXTENSION cl_intel_subgroups : enable
#else
#pragma OPENCL EXTENSION cl_khr_subgroups : enable
#endif
#ifdef cl_intel_required_subgroup_size
#pragma OPENCL EXTENSION cl_intel_required_subgroup_size : enable
#define INTEL_GPU 1
#define REQD_SUBGROUP_SIZE_16 __attribute__((intel_reqd_sub_group_size(16)))
#define REQD_SUBGROUP_SIZE_32 __attribute__((intel_reqd_sub_group_size(32)))
#elif defined(cl_qcom_reqd_sub_group_size)
#pragma OPENCL EXTENSION cl_qcom_reqd_sub_group_size : enable
#define ADRENO_GPU 1
#define REQD_SUBGROUP_SIZE_64 __attribute__((qcom_reqd_sub_group_size("half")))
#define REQD_SUBGROUP_SIZE_128 __attribute__((qcom_reqd_sub_group_size("full")))
#endif
#define QK5_0 32
inline float block_q5_0_dot_y_flat(
global const uchar * x,
global const uint * qh_ptr,
global const half * dh,
float sumy,
float16 yl,
int il,
global const float * yb
) {
float d = *dh;
global const ushort * qs = ((global const ushort *)(x + il));
float4 acc = (float4)(0.0f, 0.0f, 0.0f, 0.0f);
acc.s0 += yl.s0 * (qs[0] & 0x000F);
acc.s0 += yl.s1 * (qs[0] & 0x0F00);
acc.s0 += yl.s8 * (qs[0] & 0x00F0);
acc.s3 += yl.s9 * (qs[0] & 0xF000);
acc.s0 += yl.s2 * (qs[1] & 0x000F);
acc.s1 += yl.s3 * (qs[1] & 0x0F00);
acc.s2 += yl.sa * (qs[1] & 0x00F0);
acc.s3 += yl.sb * (qs[1] & 0xF000);
acc.s0 += yl.s4 * (qs[2] & 0x000F);
acc.s1 += yl.s5 * (qs[2] & 0x0F00);
acc.s2 += yl.sc * (qs[2] & 0x00F0);
acc.s3 += yl.sd * (qs[2] & 0xF000);
acc.s0 += yl.s6 * (qs[3] & 0x000F);
acc.s1 += yl.s7 * (qs[3] & 0x0F00);
acc.s2 += yl.se * (qs[3] & 0x00F0);
acc.s3 += yl.sf * (qs[3] & 0xF000);
uint qh_val = *qh_ptr;
uchar qh_lo = (uchar)((qh_val >> il) & 0xFF);
uchar qh_hi = (uchar)((qh_val >> (il + 16)) & 0xFF);
float qh_sum = 0.0f;
qh_sum += yb[0] * (float)((qh_lo >> 0) & 1);
qh_sum += yb[1] * (float)((qh_lo >> 1) & 1);
qh_sum += yb[2] * (float)((qh_lo >> 2) & 1);
qh_sum += yb[3] * (float)((qh_lo >> 3) & 1);
qh_sum += yb[4] * (float)((qh_lo >> 4) & 1);
qh_sum += yb[5] * (float)((qh_lo >> 5) & 1);
qh_sum += yb[6] * (float)((qh_lo >> 6) & 1);
qh_sum += yb[7] * (float)((qh_lo >> 7) & 1);
qh_sum += yb[16] * (float)((qh_hi >> 0) & 1);
qh_sum += yb[17] * (float)((qh_hi >> 1) & 1);
qh_sum += yb[18] * (float)((qh_hi >> 2) & 1);
qh_sum += yb[19] * (float)((qh_hi >> 3) & 1);
qh_sum += yb[20] * (float)((qh_hi >> 4) & 1);
qh_sum += yb[21] * (float)((qh_hi >> 5) & 1);
qh_sum += yb[22] * (float)((qh_hi >> 6) & 1);
qh_sum += yb[23] * (float)((qh_hi >> 7) & 1);
return d * (acc.s0 + acc.s1 + acc.s2 + acc.s3 + 16.0f * qh_sum - 16.0f * sumy);
}
#undef N_DST
#undef N_SIMDGROUP
#undef N_SIMDWIDTH
#ifdef INTEL_GPU
#define N_DST 4 // each subgroup works on 4 rows
#define N_SIMDGROUP 1 // number of subgroups in a thread group
#define N_SIMDWIDTH 16 // assuming subgroup size is 16
#elif defined (ADRENO_GPU)
#define N_DST 4
#define N_SIMDGROUP 1
#define N_SIMDWIDTH 64
#endif
inline void mul_vec_q_n_f32_flat(
global void * src0_qs,
global void * src0_qh,
global void * src0_d,
global float * src1,
global float * dst,
int ne00,
int ne01,
int ne02,
int ne10,
int ne12,
int ne0,
int ne1,
int r2,
int r3
) {
const ulong nb = ne00/QK5_0;
int r0 = get_group_id(0);
int r1 = get_group_id(1);
int im = get_group_id(2);
int first_row = (r0 * N_SIMDGROUP + get_sub_group_id()) * N_DST;
int i12 = im%ne12;
int i13 = im/ne12;
ulong offset0 = first_row * nb + (i12/r2)*(nb*ne01) + (i13/r3)*(nb*ne01*ne02);
ulong offset0_qs = offset0 * (QK5_0/2);
global uchar * x = (global uchar *) src0_qs + offset0_qs;
global uint * qh = (global uint *) src0_qh + offset0;
global half * d = (global half *) src0_d + offset0;
global float * y = (global float *) src1 + r1*ne10 + im*ne00*ne1;
float16 yl;
float4 sumf = (float4)(0.f, 0.f, 0.f, 0.f);
int ix = get_sub_group_local_id()/2;
int il = 8*(get_sub_group_local_id()%2);
global float * yb = y + ix * QK5_0 + il;
for (int ib = ix; ib < nb; ib += N_SIMDWIDTH/2) {
float sumy = 0;
sumy += yb[0];
sumy += yb[1];
sumy += yb[2];
sumy += yb[3];
sumy += yb[4];
sumy += yb[5];
sumy += yb[6];
sumy += yb[7];
sumy += yb[16];
sumy += yb[17];
sumy += yb[18];
sumy += yb[19];
sumy += yb[20];
sumy += yb[21];
sumy += yb[22];
sumy += yb[23];
yl.s0 = yb[0];
yl.s1 = yb[1]/256.f;
yl.s2 = yb[2];
yl.s3 = yb[3]/256.f;
yl.s4 = yb[4];
yl.s5 = yb[5]/256.f;
yl.s6 = yb[6];
yl.s7 = yb[7]/256.f;
yl.s8 = yb[16]/16.f;
yl.s9 = yb[17]/4096.f;
yl.sa = yb[18]/16.f;
yl.sb = yb[19]/4096.f;
yl.sc = yb[20]/16.f;
yl.sd = yb[21]/4096.f;
yl.se = yb[22]/16.f;
yl.sf = yb[23]/4096.f;
sumf.s0 += block_q5_0_dot_y_flat(x + ib*(QK5_0/2) + 0*nb*(QK5_0/2), qh + ib + 0*nb, d + ib + 0*nb, sumy, yl, il, yb);
sumf.s1 += block_q5_0_dot_y_flat(x + ib*(QK5_0/2) + 1*nb*(QK5_0/2), qh + ib + 1*nb, d + ib + 1*nb, sumy, yl, il, yb);
sumf.s2 += block_q5_0_dot_y_flat(x + ib*(QK5_0/2) + 2*nb*(QK5_0/2), qh + ib + 2*nb, d + ib + 2*nb, sumy, yl, il, yb);
sumf.s3 += block_q5_0_dot_y_flat(x + ib*(QK5_0/2) + 3*nb*(QK5_0/2), qh + ib + 3*nb, d + ib + 3*nb, sumy, yl, il, yb);
yb += QK5_0 * (N_SIMDWIDTH/2);
}
float4 tot = (float4)(
sub_group_reduce_add(sumf.s0), sub_group_reduce_add(sumf.s1),
sub_group_reduce_add(sumf.s2), sub_group_reduce_add(sumf.s3)
);
if (get_sub_group_local_id() == 0) {
if (first_row + 0 < ne01) {
dst[r1*ne0 + im*ne0*ne1 + first_row + 0] = tot.s0;
}
if (first_row + 1 < ne01) {
dst[r1*ne0 + im*ne0*ne1 + first_row + 1] = tot.s1;
}
if (first_row + 2 < ne01) {
dst[r1*ne0 + im*ne0*ne1 + first_row + 2] = tot.s2;
}
if (first_row + 3 < ne01) {
dst[r1*ne0 + im*ne0*ne1 + first_row + 3] = tot.s3;
}
}
}
#ifdef INTEL_GPU
REQD_SUBGROUP_SIZE_16
#elif defined (ADRENO_GPU)
REQD_SUBGROUP_SIZE_64
#endif
kernel void kernel_mul_mv_q5_0_f32_flat(
global void * src0_qs,
global void * src0_qh,
global void * src0_d,
global float * src1,
ulong offset1,
global float * dst,
ulong offsetd,
int ne00,
int ne01,
int ne02,
int ne10,
int ne12,
int ne0,
int ne1,
int r2,
int r3
) {
src1 = (global float*)((global char*)src1 + offset1);
dst = (global float*)((global char*)dst + offsetd);
mul_vec_q_n_f32_flat(src0_qs, src0_qh, src0_d, src1, dst, ne00, ne01, ne02, ne10, ne12, ne0, ne1, r2, r3);
}
@@ -0,0 +1,243 @@
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
#ifdef cl_intel_subgroups
#pragma OPENCL EXTENSION cl_intel_subgroups : enable
#else
#pragma OPENCL EXTENSION cl_khr_subgroups : enable
#endif
#ifdef cl_intel_required_subgroup_size
#pragma OPENCL EXTENSION cl_intel_required_subgroup_size : enable
#define INTEL_GPU 1
#define REQD_SUBGROUP_SIZE_16 __attribute__((intel_reqd_sub_group_size(16)))
#define REQD_SUBGROUP_SIZE_32 __attribute__((intel_reqd_sub_group_size(32)))
#elif defined(cl_qcom_reqd_sub_group_size)
#pragma OPENCL EXTENSION cl_qcom_reqd_sub_group_size : enable
#define ADRENO_GPU 1
#define REQD_SUBGROUP_SIZE_64 __attribute__((qcom_reqd_sub_group_size("half")))
#define REQD_SUBGROUP_SIZE_128 __attribute__((qcom_reqd_sub_group_size("full")))
#endif
#define QK5_1 32
struct block_q5_1 {
half d;
half m;
uchar qh[4];
uchar qs[QK5_1 / 2];
};
inline float block_q5_1_dot_y(
global const struct block_q5_1 * qb_curr,
float sumy,
float16 yl,
int il,
global const float * yb
) {
float d = qb_curr->d;
float m = qb_curr->m;
float4 acc = (float4)(0.0f, 0.0f, 0.0f, 0.0f);
global const ushort * qs = ((global const ushort *)((global const uchar *) qb_curr + 8 + il));
acc.s0 += yl.s0 * (qs[0] & 0x000F);
acc.s0 += yl.s1 * (qs[0] & 0x0F00);
acc.s0 += yl.s8 * (qs[0] & 0x00F0);
acc.s3 += yl.s9 * (qs[0] & 0xF000);
acc.s0 += yl.s2 * (qs[1] & 0x000F);
acc.s1 += yl.s3 * (qs[1] & 0x0F00);
acc.s2 += yl.sa * (qs[1] & 0x00F0);
acc.s3 += yl.sb * (qs[1] & 0xF000);
acc.s0 += yl.s4 * (qs[2] & 0x000F);
acc.s1 += yl.s5 * (qs[2] & 0x0F00);
acc.s2 += yl.sc * (qs[2] & 0x00F0);
acc.s3 += yl.sd * (qs[2] & 0xF000);
acc.s0 += yl.s6 * (qs[3] & 0x000F);
acc.s1 += yl.s7 * (qs[3] & 0x0F00);
acc.s2 += yl.se * (qs[3] & 0x00F0);
acc.s3 += yl.sf * (qs[3] & 0xF000);
uint qh_val = *((global const uint *)((global const uchar *) qb_curr + 4));
uchar qh_lo = (uchar)((qh_val >> il) & 0xFF);
uchar qh_hi = (uchar)((qh_val >> (il + 16)) & 0xFF);
float qh_sum = 0.0f;
qh_sum += yb[0] * (float)((qh_lo >> 0) & 1);
qh_sum += yb[1] * (float)((qh_lo >> 1) & 1);
qh_sum += yb[2] * (float)((qh_lo >> 2) & 1);
qh_sum += yb[3] * (float)((qh_lo >> 3) & 1);
qh_sum += yb[4] * (float)((qh_lo >> 4) & 1);
qh_sum += yb[5] * (float)((qh_lo >> 5) & 1);
qh_sum += yb[6] * (float)((qh_lo >> 6) & 1);
qh_sum += yb[7] * (float)((qh_lo >> 7) & 1);
qh_sum += yb[16] * (float)((qh_hi >> 0) & 1);
qh_sum += yb[17] * (float)((qh_hi >> 1) & 1);
qh_sum += yb[18] * (float)((qh_hi >> 2) & 1);
qh_sum += yb[19] * (float)((qh_hi >> 3) & 1);
qh_sum += yb[20] * (float)((qh_hi >> 4) & 1);
qh_sum += yb[21] * (float)((qh_hi >> 5) & 1);
qh_sum += yb[22] * (float)((qh_hi >> 6) & 1);
qh_sum += yb[23] * (float)((qh_hi >> 7) & 1);
return d * (acc.s0 + acc.s1 + acc.s2 + acc.s3 + 16.0f * qh_sum) + sumy * m;
}
#undef N_DST
#undef N_SIMDGROUP
#undef N_SIMDWIDTH
#ifdef INTEL_GPU
#define N_DST 4 // each subgroup works on 4 rows
#define N_SIMDGROUP 1 // number of subgroups in a thread group
#define N_SIMDWIDTH 16 // assuming subgroup size is 16
#elif defined (ADRENO_GPU)
#define N_DST 4
#define N_SIMDGROUP 1
#define N_SIMDWIDTH 64
#endif
inline void mul_vec_q_n_f32(
global void * src0,
global float * src1,
global float * dst,
int ne00,
int ne01,
int ne02,
int ne10,
int ne12,
int ne0,
int ne1,
int r2,
int r3
) {
const ulong nb = ne00/QK5_1;
int r0 = get_group_id(0);
int r1 = get_group_id(1);
int im = get_group_id(2);
int first_row = (r0 * N_SIMDGROUP + get_sub_group_id()) * N_DST;
int i12 = im%ne12;
int i13 = im/ne12;
ulong offset0 = first_row * nb + (i12/r2)*(nb*ne01) + (i13/r3)*(nb*ne01*ne02);
global struct block_q5_1 * x = (global struct block_q5_1 *) src0 + offset0;
global float * y = (global float *) src1 + r1*ne10 + im*ne00*ne1;
float16 yl;
float4 sumf = (float4)(0.f, 0.f, 0.f, 0.f);
int ix = get_sub_group_local_id()/2;
int il = 8*(get_sub_group_local_id()%2);
global float * yb = y + ix * QK5_1 + il;
for (int ib = ix; ib < nb; ib += N_SIMDWIDTH/2) {
float sumy = 0;
sumy += yb[0];
sumy += yb[1];
sumy += yb[2];
sumy += yb[3];
sumy += yb[4];
sumy += yb[5];
sumy += yb[6];
sumy += yb[7];
sumy += yb[16];
sumy += yb[17];
sumy += yb[18];
sumy += yb[19];
sumy += yb[20];
sumy += yb[21];
sumy += yb[22];
sumy += yb[23];
yl.s0 = yb[0];
yl.s1 = yb[1]/256.f;
yl.s2 = yb[2];
yl.s3 = yb[3]/256.f;
yl.s4 = yb[4];
yl.s5 = yb[5]/256.f;
yl.s6 = yb[6];
yl.s7 = yb[7]/256.f;
yl.s8 = yb[16]/16.f;
yl.s9 = yb[17]/4096.f;
yl.sa = yb[18]/16.f;
yl.sb = yb[19]/4096.f;
yl.sc = yb[20]/16.f;
yl.sd = yb[21]/4096.f;
yl.se = yb[22]/16.f;
yl.sf = yb[23]/4096.f;
sumf.s0 += block_q5_1_dot_y(x+ib+0*nb, sumy, yl, il, yb);
sumf.s1 += block_q5_1_dot_y(x+ib+1*nb, sumy, yl, il, yb);
sumf.s2 += block_q5_1_dot_y(x+ib+2*nb, sumy, yl, il, yb);
sumf.s3 += block_q5_1_dot_y(x+ib+3*nb, sumy, yl, il, yb);
yb += QK5_1 * (N_SIMDWIDTH/2);
}
float4 tot = (float4)(
sub_group_reduce_add(sumf.s0), sub_group_reduce_add(sumf.s1),
sub_group_reduce_add(sumf.s2), sub_group_reduce_add(sumf.s3)
);
if (get_sub_group_local_id() == 0) {
if (first_row + 0 < ne01) {
dst[r1*ne0 + im*ne0*ne1 + first_row + 0] = tot.s0;
}
if (first_row + 1 < ne01) {
dst[r1*ne0 + im*ne0*ne1 + first_row + 1] = tot.s1;
}
if (first_row + 2 < ne01) {
dst[r1*ne0 + im*ne0*ne1 + first_row + 2] = tot.s2;
}
if (first_row + 3 < ne01) {
dst[r1*ne0 + im*ne0*ne1 + first_row + 3] = tot.s3;
}
}
}
#ifdef INTEL_GPU
REQD_SUBGROUP_SIZE_16
#elif defined (ADRENO_GPU)
REQD_SUBGROUP_SIZE_64
#endif
kernel void kernel_mul_mv_q5_1_f32(
global void * src0,
ulong offset0,
global float * src1,
ulong offset1,
global float * dst,
ulong offsetd,
int ne00,
int ne01,
int ne02,
int ne10,
int ne12,
int ne0,
int ne1,
int r2,
int r3
) {
src0 = (global void*)((global char*)src0 + offset0);
src1 = (global float*)((global char*)src1 + offset1);
dst = (global float*)((global char*)dst + offsetd);
mul_vec_q_n_f32(src0, src1, dst, ne00, ne01, ne02, ne10, ne12, ne0, ne1, r2, r3);
}
@@ -0,0 +1,247 @@
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
#ifdef cl_intel_subgroups
#pragma OPENCL EXTENSION cl_intel_subgroups : enable
#else
#pragma OPENCL EXTENSION cl_khr_subgroups : enable
#endif
#ifdef cl_intel_required_subgroup_size
#pragma OPENCL EXTENSION cl_intel_required_subgroup_size : enable
#define INTEL_GPU 1
#define REQD_SUBGROUP_SIZE_16 __attribute__((intel_reqd_sub_group_size(16)))
#define REQD_SUBGROUP_SIZE_32 __attribute__((intel_reqd_sub_group_size(32)))
#elif defined(cl_qcom_reqd_sub_group_size)
#pragma OPENCL EXTENSION cl_qcom_reqd_sub_group_size : enable
#define ADRENO_GPU 1
#define REQD_SUBGROUP_SIZE_64 __attribute__((qcom_reqd_sub_group_size("half")))
#define REQD_SUBGROUP_SIZE_128 __attribute__((qcom_reqd_sub_group_size("full")))
#endif
#define QK5_1 32
inline float block_q5_1_dot_y_flat(
global const uchar * x,
global const uint * qh_ptr,
global const half * dh,
global const half * mh,
float sumy,
float16 yl,
int il,
global const float * yb
) {
float d = *dh;
float m = *mh;
global const ushort * qs = ((global const ushort *)(x + il));
float4 acc = (float4)(0.0f, 0.0f, 0.0f, 0.0f);
acc.s0 += yl.s0 * (qs[0] & 0x000F);
acc.s0 += yl.s1 * (qs[0] & 0x0F00);
acc.s0 += yl.s8 * (qs[0] & 0x00F0);
acc.s3 += yl.s9 * (qs[0] & 0xF000);
acc.s0 += yl.s2 * (qs[1] & 0x000F);
acc.s1 += yl.s3 * (qs[1] & 0x0F00);
acc.s2 += yl.sa * (qs[1] & 0x00F0);
acc.s3 += yl.sb * (qs[1] & 0xF000);
acc.s0 += yl.s4 * (qs[2] & 0x000F);
acc.s1 += yl.s5 * (qs[2] & 0x0F00);
acc.s2 += yl.sc * (qs[2] & 0x00F0);
acc.s3 += yl.sd * (qs[2] & 0xF000);
acc.s0 += yl.s6 * (qs[3] & 0x000F);
acc.s1 += yl.s7 * (qs[3] & 0x0F00);
acc.s2 += yl.se * (qs[3] & 0x00F0);
acc.s3 += yl.sf * (qs[3] & 0xF000);
uint qh_val = *qh_ptr;
uchar qh_lo = (uchar)((qh_val >> il) & 0xFF);
uchar qh_hi = (uchar)((qh_val >> (il + 16)) & 0xFF);
float qh_sum = 0.0f;
qh_sum += yb[0] * (float)((qh_lo >> 0) & 1);
qh_sum += yb[1] * (float)((qh_lo >> 1) & 1);
qh_sum += yb[2] * (float)((qh_lo >> 2) & 1);
qh_sum += yb[3] * (float)((qh_lo >> 3) & 1);
qh_sum += yb[4] * (float)((qh_lo >> 4) & 1);
qh_sum += yb[5] * (float)((qh_lo >> 5) & 1);
qh_sum += yb[6] * (float)((qh_lo >> 6) & 1);
qh_sum += yb[7] * (float)((qh_lo >> 7) & 1);
qh_sum += yb[16] * (float)((qh_hi >> 0) & 1);
qh_sum += yb[17] * (float)((qh_hi >> 1) & 1);
qh_sum += yb[18] * (float)((qh_hi >> 2) & 1);
qh_sum += yb[19] * (float)((qh_hi >> 3) & 1);
qh_sum += yb[20] * (float)((qh_hi >> 4) & 1);
qh_sum += yb[21] * (float)((qh_hi >> 5) & 1);
qh_sum += yb[22] * (float)((qh_hi >> 6) & 1);
qh_sum += yb[23] * (float)((qh_hi >> 7) & 1);
return d * (acc.s0 + acc.s1 + acc.s2 + acc.s3 + 16.0f * qh_sum) + sumy * m;
}
#undef N_DST
#undef N_SIMDGROUP
#undef N_SIMDWIDTH
#ifdef INTEL_GPU
#define N_DST 4 // each subgroup works on 4 rows
#define N_SIMDGROUP 1 // number of subgroups in a thread group
#define N_SIMDWIDTH 16 // assuming subgroup size is 16
#elif defined (ADRENO_GPU)
#define N_DST 4
#define N_SIMDGROUP 1
#define N_SIMDWIDTH 64
#endif
inline void mul_vec_q_n_f32_flat(
global void * src0_qs,
global void * src0_qh,
global void * src0_d,
global void * src0_m,
global float * src1,
global float * dst,
int ne00,
int ne01,
int ne02,
int ne10,
int ne12,
int ne0,
int ne1,
int r2,
int r3
) {
const ulong nb = ne00/QK5_1;
int r0 = get_group_id(0);
int r1 = get_group_id(1);
int im = get_group_id(2);
int first_row = (r0 * N_SIMDGROUP + get_sub_group_id()) * N_DST;
int i12 = im%ne12;
int i13 = im/ne12;
ulong offset0 = first_row * nb + (i12/r2)*(nb*ne01) + (i13/r3)*(nb*ne01*ne02);
ulong offset0_qs = offset0 * (QK5_1/2);
global uchar * x = (global uchar *) src0_qs + offset0_qs;
global uint * qh = (global uint *) src0_qh + offset0;
global half * d = (global half *) src0_d + offset0;
global half * ms = (global half *) src0_m + offset0;
global float * y = (global float *) src1 + r1*ne10 + im*ne00*ne1;
float16 yl;
float4 sumf = (float4)(0.f, 0.f, 0.f, 0.f);
int ix = get_sub_group_local_id()/2;
int il = 8*(get_sub_group_local_id()%2);
global float * yb = y + ix * QK5_1 + il;
for (int ib = ix; ib < nb; ib += N_SIMDWIDTH/2) {
float sumy = 0;
sumy += yb[0];
sumy += yb[1];
sumy += yb[2];
sumy += yb[3];
sumy += yb[4];
sumy += yb[5];
sumy += yb[6];
sumy += yb[7];
sumy += yb[16];
sumy += yb[17];
sumy += yb[18];
sumy += yb[19];
sumy += yb[20];
sumy += yb[21];
sumy += yb[22];
sumy += yb[23];
yl.s0 = yb[0];
yl.s1 = yb[1]/256.f;
yl.s2 = yb[2];
yl.s3 = yb[3]/256.f;
yl.s4 = yb[4];
yl.s5 = yb[5]/256.f;
yl.s6 = yb[6];
yl.s7 = yb[7]/256.f;
yl.s8 = yb[16]/16.f;
yl.s9 = yb[17]/4096.f;
yl.sa = yb[18]/16.f;
yl.sb = yb[19]/4096.f;
yl.sc = yb[20]/16.f;
yl.sd = yb[21]/4096.f;
yl.se = yb[22]/16.f;
yl.sf = yb[23]/4096.f;
sumf.s0 += block_q5_1_dot_y_flat(x + ib*(QK5_1/2) + 0*nb*(QK5_1/2), qh + ib + 0*nb, d + ib + 0*nb, ms + ib + 0*nb, sumy, yl, il, yb);
sumf.s1 += block_q5_1_dot_y_flat(x + ib*(QK5_1/2) + 1*nb*(QK5_1/2), qh + ib + 1*nb, d + ib + 1*nb, ms + ib + 1*nb, sumy, yl, il, yb);
sumf.s2 += block_q5_1_dot_y_flat(x + ib*(QK5_1/2) + 2*nb*(QK5_1/2), qh + ib + 2*nb, d + ib + 2*nb, ms + ib + 2*nb, sumy, yl, il, yb);
sumf.s3 += block_q5_1_dot_y_flat(x + ib*(QK5_1/2) + 3*nb*(QK5_1/2), qh + ib + 3*nb, d + ib + 3*nb, ms + ib + 3*nb, sumy, yl, il, yb);
yb += QK5_1 * (N_SIMDWIDTH/2);
}
float4 tot = (float4)(
sub_group_reduce_add(sumf.s0), sub_group_reduce_add(sumf.s1),
sub_group_reduce_add(sumf.s2), sub_group_reduce_add(sumf.s3)
);
if (get_sub_group_local_id() == 0) {
if (first_row + 0 < ne01) {
dst[r1*ne0 + im*ne0*ne1 + first_row + 0] = tot.s0;
}
if (first_row + 1 < ne01) {
dst[r1*ne0 + im*ne0*ne1 + first_row + 1] = tot.s1;
}
if (first_row + 2 < ne01) {
dst[r1*ne0 + im*ne0*ne1 + first_row + 2] = tot.s2;
}
if (first_row + 3 < ne01) {
dst[r1*ne0 + im*ne0*ne1 + first_row + 3] = tot.s3;
}
}
}
#ifdef INTEL_GPU
REQD_SUBGROUP_SIZE_16
#elif defined (ADRENO_GPU)
REQD_SUBGROUP_SIZE_64
#endif
kernel void kernel_mul_mv_q5_1_f32_flat(
global void * src0_qs,
global void * src0_qh,
global void * src0_d,
global void * src0_m,
global float * src1,
ulong offset1,
global float * dst,
ulong offsetd,
int ne00,
int ne01,
int ne02,
int ne10,
int ne12,
int ne0,
int ne1,
int r2,
int r3
) {
src1 = (global float*)((global char*)src1 + offset1);
dst = (global float*)((global char*)dst + offsetd);
mul_vec_q_n_f32_flat(src0_qs, src0_qh, src0_d, src0_m, src1, dst, ne00, ne01, ne02, ne10, ne12, ne0, ne1, r2, r3);
}
+1
View File
@@ -45,6 +45,7 @@ namespace syclexp = sycl::ext::oneapi::experimental;
#define GGML_COMMON_IMPL_SYCL
#define SYCL_FLASH_ATTN //remove it to disable FLASH_ATTENTION in building.
#define SYCL_FAST_FP16 //don't change. remove it will break fattn-tile.hpp building
#define GGML_SYCL_FA_ALL_QUANTS //define it to enable all quantization types in flash attention. undefine it to only support F16, Q4_0 and Q8_0 in flash attention.
/* suppress warning spam */
#pragma clang diagnostic push
+23 -2
View File
@@ -107,6 +107,19 @@ static void dequantize_row_q3_K_sycl(const void *vx, dst_t *y, const int64_t k,
#endif
}
template <typename dst_t>
static void dequantize_row_q3_K_sycl_reorder(const void *vx, dst_t *y, const int64_t k,
dpct::queue_ptr stream) {
const int64_t nb = k / QK_K;
dpct::has_capability_or_fail(stream->get_device(), { sycl::aspect::fp16 });
stream->parallel_for(
sycl::nd_range<3>(sycl::range<3>(1, 1, nb) * sycl::range<3>(1, 1, 64), sycl::range<3>(1, 1, 64)),
[=](sycl::nd_item<3> item_ct1) {
dequantize_block_q3_K_reorder(vx, y, item_ct1, nb);
});
}
template <typename dst_t>
static void dequantize_row_q4_0_sycl(const void *vx, dst_t *y, const int64_t k,
dpct::queue_ptr stream) {
@@ -652,7 +665,11 @@ to_fp16_sycl_t ggml_get_to_fp16_sycl(ggml_type type, ggml_tensor * dst) {
case GGML_TYPE_Q2_K:
return dequantize_row_q2_K_sycl;
case GGML_TYPE_Q3_K:
return dequantize_row_q3_K_sycl;
if (dst->src[0]->extra && ((ggml_tensor_extra_gpu *) dst->src[0]->extra)->optimized_feature.reorder) {
return dequantize_row_q3_K_sycl_reorder;
} else {
return dequantize_row_q3_K_sycl;
}
case GGML_TYPE_Q4_K:
if (dst->src[0]->extra && ((ggml_tensor_extra_gpu *) dst->src[0]->extra)->optimized_feature.reorder) {
return dequantize_row_q4_K_sycl_reorder;
@@ -730,7 +747,11 @@ to_fp32_sycl_t ggml_get_to_fp32_sycl(ggml_type type, ggml_tensor *dst) {
case GGML_TYPE_Q2_K:
return dequantize_row_q2_K_sycl;
case GGML_TYPE_Q3_K:
return dequantize_row_q3_K_sycl;
if (dst->src[0]->extra && ((ggml_tensor_extra_gpu *) dst->src[0]->extra)->optimized_feature.reorder) {
return dequantize_row_q3_K_sycl_reorder;
} else {
return dequantize_row_q3_K_sycl;
}
case GGML_TYPE_Q4_K:
if (dst->src[0]->extra &&
((ggml_tensor_extra_gpu*)dst->src[0]->extra)->optimized_feature.reorder) {
+529
View File
@@ -20,6 +20,10 @@ typedef void (*dequantize_kernel_t)(const void * vx, const int64_t ib, const int
typedef void (*dequantize_kernel_t_reorder)(const void *d, const int64_t ib, const void *qs,
const int iqs, dfloat2 &v);
#if QK_K == 256
static inline void get_scale_min_k4(int j, const uint8_t * q, uint8_t & d, uint8_t & m);
#endif
static __dpct_inline__ void dequantize_q4_0(const void *vx, const int64_t ib,
const int iqs, dfloat2 &v) {
const block_q4_0 * x = (const block_q4_0 *) vx;
@@ -90,6 +94,474 @@ static __dpct_inline__ void dequantize_q4_1(const void *vx, const int64_t ib,
#endif // GGML_SYCL_F16
}
static __dpct_inline__ void dequantize_q4_K(const void *vx, const int64_t ib,
const int iqs, dfloat2 &v) {
#if QK_K == 256
const block_q4_K * x = (const block_q4_K *) vx;
const sycl::half2 dm = x[ib].dm;
const float dall = dm[0];
const float dmin = dm[1];
auto dequantize_one = [&](const int idx) -> dfloat {
const int il = idx / 64;
const int in = idx % 64;
const int is = 2 * il + (in >= 32 ? 1 : 0);
const int off = in & 31;
const int qsi = 32 * il + off;
uint8_t sc;
uint8_t m;
get_scale_min_k4(is, x[ib].scales, sc, m);
const uint8_t q = x[ib].qs[qsi];
const uint8_t qv = (in >= 32) ? (q >> 4) : (q & 0xF);
return sycl::fma((dfloat) qv, (dfloat) (dall * sc), (dfloat) (-dmin * m));
};
v.x() = dequantize_one(iqs + 0);
v.y() = dequantize_one(iqs + 1);
#else
GGML_ABORT("Q4_K dequantize not supported for QK_K != 256");
#endif
}
static __dpct_inline__ void dequantize_q2_K(const void *vx, const int64_t ib,
const int iqs, dfloat2 &v) {
#if QK_K == 256
const block_q2_K * x = (const block_q2_K *) vx;
const float dall = x[ib].dm[0];
const float dmin = x[ib].dm[1];
auto dequantize_one = [&](const int idx) -> dfloat {
const int n = idx / 128;
const int r = idx % 128;
const int g = r / 32;
const int l = r % 32;
const int is = 8 * n + l / 16;
const uint8_t q = x[ib].qs[32 * n + l];
const uint8_t sc = x[ib].scales[is + 2 * g];
const float d = dall * (sc & 0xF);
const float m = dmin * (sc >> 4);
return sycl::fma((dfloat) ((q >> (2 * g)) & 3), (dfloat) d, (dfloat) (-m));
};
v.x() = dequantize_one(iqs + 0);
v.y() = dequantize_one(iqs + 1);
#else
GGML_ABORT("Q2_K dequantize not supported for QK_K != 256");
#endif
}
static __dpct_inline__ void dequantize_q3_K(const void *vx, const int64_t ib,
const int iqs, dfloat2 &v) {
#if QK_K == 256
const block_q3_K * x = (const block_q3_K *) vx;
const float d_all = x[ib].d;
auto dequantize_one = [&](const int idx) -> dfloat {
const int n = idx / 128;
const int r = idx % 128;
const int j = r / 32;
const int l = r % 32;
const int is0 = l / 16;
const int is = 8 * n + 2 * j + is0;
const int shift = 2 * j;
const uint8_t m = 1 << (4 * n + j);
const int8_t us = is < 4 ? (x[ib].scales[is - 0] & 0xF) | (((x[ib].scales[is + 8] >> 0) & 3) << 4) :
is < 8 ? (x[ib].scales[is - 0] & 0xF) | (((x[ib].scales[is + 4] >> 2) & 3) << 4) :
is < 12 ? (x[ib].scales[is - 8] >> 4) | (((x[ib].scales[is + 0] >> 4) & 3) << 4) :
(x[ib].scales[is - 8] >> 4) | (((x[ib].scales[is - 4] >> 6) & 3) << 4);
const float dl = d_all * (us - 32);
const uint8_t q = x[ib].qs[32 * n + l];
const uint8_t h = x[ib].hmask[l];
const int8_t qv = ((q >> shift) & 3) - ((h & m) ? 0 : 4);
return (dfloat) (dl * qv);
};
v.x() = dequantize_one(iqs + 0);
v.y() = dequantize_one(iqs + 1);
#else
GGML_ABORT("Q3_K dequantize not supported for QK_K != 256");
#endif
}
static __dpct_inline__ void dequantize_q5_K(const void *vx, const int64_t ib,
const int iqs, dfloat2 &v) {
#if QK_K == 256
const block_q5_K * x = (const block_q5_K *) vx;
const float dall = x[ib].dm[0];
const float dmin = x[ib].dm[1];
auto dequantize_one = [&](const int idx) -> dfloat {
const int il = idx / 64;
const int in = idx % 64;
const int is = 2 * il + (in >= 32 ? 1 : 0);
const int ir = (in & 31) / 2;
const int iq = in & 1;
const uint8_t q = x[ib].qs[32 * il + 2 * ir + iq];
const uint8_t h = x[ib].qh[2 * ir + iq];
const uint8_t qv = (in >= 32) ? (q >> 4) : (q & 0xF);
uint8_t sc;
uint8_t m;
get_scale_min_k4(is, x[ib].scales, sc, m);
const float d = dall * sc;
const float mn = dmin * m;
const uint8_t hm = 1 << (2 * il + (in >= 32 ? 1 : 0));
return sycl::fma((dfloat) (qv + ((h & hm) ? 16 : 0)), (dfloat) d, (dfloat) (-mn));
};
v.x() = dequantize_one(iqs + 0);
v.y() = dequantize_one(iqs + 1);
#else
GGML_ABORT("Q5_K dequantize not supported for QK_K != 256");
#endif
}
static __dpct_inline__ void dequantize_q6_K(const void *vx, const int64_t ib,
const int iqs, dfloat2 &v) {
#if QK_K == 256
const block_q6_K * x = (const block_q6_K *) vx;
const float d = x[ib].d;
auto dequantize_one = [&](const int idx) -> dfloat {
const int ip = idx / 128;
const int in = idx % 128;
const int il = in & 31;
const int ig = in / 32;
const int is = 8 * ip + il / 16;
const uint8_t ql0 = x[ib].ql[64 * ip + il];
const uint8_t ql1 = x[ib].ql[64 * ip + il + 32];
const uint8_t qh = x[ib].qh[32 * ip + il];
const int8_t * sc = x[ib].scales + is;
uint8_t qv;
int8_t scale;
if (ig == 0) {
qv = (ql0 & 0xF) | (((qh >> 0) & 3) << 4);
scale = sc[0];
} else if (ig == 1) {
qv = (ql1 & 0xF) | (((qh >> 2) & 3) << 4);
scale = sc[2];
} else if (ig == 2) {
qv = (ql0 >> 4) | (((qh >> 4) & 3) << 4);
scale = sc[4];
} else {
qv = (ql1 >> 4) | (((qh >> 6) & 3) << 4);
scale = sc[6];
}
return (dfloat) (d * scale * ((int8_t) qv - 32));
};
v.x() = dequantize_one(iqs + 0);
v.y() = dequantize_one(iqs + 1);
#else
GGML_ABORT("Q6_K dequantize not supported for QK_K != 256");
#endif
}
static __dpct_inline__ void dequantize_mxfp4(const void *vx, const int64_t ib,
const int iqs, dfloat2 &v) {
const block_mxfp4 * x = (const block_mxfp4 *) vx;
const float d = ggml_sycl_e8m0_to_fp32(x[ib].e);
const uint8_t q = x[ib].qs[iqs];
v.x() = d * kvalues_mxfp4[q & 0xF] * 0.5f;
v.y() = d * kvalues_mxfp4[q >> 4] * 0.5f;
}
static __dpct_inline__ void dequantize_q1_0(const void *vx, const int64_t ib,
const int iqs, dfloat2 &v) {
const block_q1_0 * x = (const block_q1_0 *) vx;
const dfloat d = x[ib].d;
const int bit_index_0 = iqs + 0;
const int bit_index_1 = iqs + 1;
const int bit_0 = (x[ib].qs[bit_index_0 / 8] >> (bit_index_0 % 8)) & 1;
const int bit_1 = (x[ib].qs[bit_index_1 / 8] >> (bit_index_1 % 8)) & 1;
v.x() = (2 * bit_0 - 1) * d;
v.y() = (2 * bit_1 - 1) * d;
}
static __dpct_inline__ void dequantize_nvfp4(const void *vx, const int64_t ib,
const int iqs, dfloat2 &v) {
const block_nvfp4 & xb = ((const block_nvfp4 *) vx)[ib];
auto dequantize_one = [&](const int idx) -> dfloat {
const int sub = idx / QK_NVFP4_SUB;
const int j = idx % QK_NVFP4_SUB;
const int jh = j % (QK_NVFP4_SUB / 2);
const float d = ggml_sycl_ue4m3_to_fp32(xb.d[sub]);
const uint8_t q = xb.qs[sub * (QK_NVFP4_SUB / 2) + jh];
const uint8_t qv = (j < (QK_NVFP4_SUB / 2)) ? (q & 0x0F) : (q >> 4);
return d * kvalues_mxfp4[qv];
};
v.x() = dequantize_one(iqs + 0);
v.y() = dequantize_one(iqs + 1);
}
static __dpct_inline__ void dequantize_iq2_xxs(const void *vx, const int64_t ib,
const int iqs, dfloat2 &v) {
#if QK_K == 256
const block_iq2_xxs * x = (const block_iq2_xxs *) vx;
auto dequantize_one = [&](const int idx) -> dfloat {
const int ib8 = idx / 32;
const int r = idx % 32;
const int il = r / 8;
const int j = r % 8;
const uint16_t * q2 = x[ib].qs + 4 * ib8;
const uint8_t * aux8 = (const uint8_t *) q2;
const uint8_t * grid = (const uint8_t *) (iq2xxs_grid + aux8[il]);
const uint32_t aux32 = q2[2] | (q2[3] << 16);
const float d = (float) x[ib].d * (0.5f + (aux32 >> 28)) * 0.25f;
const uint8_t signs = ksigns_iq2xs[(aux32 >> (7 * il)) & 127];
return d * grid[j] * ((signs & kmask_iq2xs[j]) ? -1.f : 1.f);
};
v.x() = dequantize_one(iqs + 0);
v.y() = dequantize_one(iqs + 1);
#else
GGML_ABORT("IQ2_XXS dequantize not supported for QK_K != 256");
#endif
}
static __dpct_inline__ void dequantize_iq2_xs(const void *vx, const int64_t ib,
const int iqs, dfloat2 &v) {
#if QK_K == 256
const block_iq2_xs * x = (const block_iq2_xs *) vx;
auto dequantize_one = [&](const int idx) -> dfloat {
const int ib8 = idx / 32;
const int r = idx % 32;
const int il = r / 8;
const int j = r % 8;
const uint16_t * q2 = x[ib].qs + 4 * ib8;
const uint8_t * grid = (const uint8_t *) (iq2xs_grid + (q2[il] & 511));
const float d = (float) x[ib].d * (0.5f + ((x[ib].scales[ib8] >> (4 * (il / 2))) & 0xf)) * 0.25f;
const uint8_t signs = ksigns_iq2xs[q2[il] >> 9];
return d * grid[j] * ((signs & kmask_iq2xs[j]) ? -1.f : 1.f);
};
v.x() = dequantize_one(iqs + 0);
v.y() = dequantize_one(iqs + 1);
#else
GGML_ABORT("IQ2_XS dequantize not supported for QK_K != 256");
#endif
}
static __dpct_inline__ void dequantize_iq2_s(const void *vx, const int64_t ib,
const int iqs, dfloat2 &v) {
#if QK_K == 256
const block_iq2_s * x = (const block_iq2_s *) vx;
auto dequantize_one = [&](const int idx) -> dfloat {
const int ib8 = idx / 32;
const int r = idx % 32;
const int il = r / 8;
const int j = r % 8;
const uint16_t grid_id = x[ib].qs[4 * ib8 + il] | ((x[ib].qh[ib8] << (8 - 2 * il)) & 0x300);
const uint8_t * grid = (const uint8_t *) (iq2s_grid + grid_id);
const float d = (float) x[ib].d * (0.5f + ((x[ib].scales[ib8] >> (4 * (il / 2))) & 0xf)) * 0.25f;
const uint8_t signs = x[ib].qs[QK_K / 8 + 4 * ib8 + il];
return d * grid[j] * ((signs & kmask_iq2xs[j]) ? -1.f : 1.f);
};
v.x() = dequantize_one(iqs + 0);
v.y() = dequantize_one(iqs + 1);
#else
GGML_ABORT("IQ2_S dequantize not supported for QK_K != 256");
#endif
}
static __dpct_inline__ void dequantize_iq3_xxs(const void *vx, const int64_t ib,
const int iqs, dfloat2 &v) {
#if QK_K == 256
const block_iq3_xxs * x = (const block_iq3_xxs *) vx;
auto dequantize_one = [&](const int idx) -> dfloat {
const int ib8 = idx / 32;
const int r = idx % 32;
const int il = r / 8;
const int j = r % 8;
const uint8_t * q3 = x[ib].qs + 8 * ib8;
const uint16_t * gas = (const uint16_t *) (x[ib].qs + QK_K / 4) + 2 * ib8;
const uint8_t * grid1 = (const uint8_t *) (iq3xxs_grid + q3[2 * il + 0]);
const uint8_t * grid2 = (const uint8_t *) (iq3xxs_grid + q3[2 * il + 1]);
const uint32_t aux32 = gas[0] | (gas[1] << 16);
const float d = (float) x[ib].d * (0.5f + (aux32 >> 28)) * 0.5f;
const uint8_t signs = ksigns_iq2xs[(aux32 >> (7 * il)) & 127];
if (j < 4) {
return d * grid1[j] * ((signs & kmask_iq2xs[j + 0]) ? -1.f : 1.f);
}
return d * grid2[j - 4] * ((signs & kmask_iq2xs[j + 0]) ? -1.f : 1.f);
};
v.x() = dequantize_one(iqs + 0);
v.y() = dequantize_one(iqs + 1);
#else
GGML_ABORT("IQ3_XXS dequantize not supported for QK_K != 256");
#endif
}
static __dpct_inline__ void dequantize_iq3_s(const void *vx, const int64_t ib,
const int iqs, dfloat2 &v) {
#if QK_K == 256
const block_iq3_s * x = (const block_iq3_s *) vx;
auto dequantize_one = [&](const int idx) -> dfloat {
const int ib8 = idx / 32;
const int r = idx % 32;
const int il = r / 8;
const int j = r % 8;
const uint8_t * qs = x[ib].qs + 8 * ib8;
const uint16_t grid1_id = qs[2 * il + 0] | ((x[ib].qh[ib8] << (8 - 2 * il)) & 256);
const uint16_t grid2_id = qs[2 * il + 1] | ((x[ib].qh[ib8] << (7 - 2 * il)) & 256);
const uint8_t * grid1 = (const uint8_t *) (iq3s_grid + grid1_id);
const uint8_t * grid2 = (const uint8_t *) (iq3s_grid + grid2_id);
const float d = (float) x[ib].d * (1 + 2 * ((x[ib].scales[ib8 / 2] >> (4 * (ib8 % 2))) & 0xf));
const uint8_t signs = x[ib].signs[4 * ib8 + il];
if (j < 4) {
return d * grid1[j] * ((signs & kmask_iq2xs[j + 0]) ? -1.f : 1.f);
}
return d * grid2[j - 4] * ((signs & kmask_iq2xs[j + 0]) ? -1.f : 1.f);
};
v.x() = dequantize_one(iqs + 0);
v.y() = dequantize_one(iqs + 1);
#else
GGML_ABORT("IQ3_S dequantize not supported for QK_K != 256");
#endif
}
static __dpct_inline__ void dequantize_iq1_s(const void *vx, const int64_t ib,
const int iqs, dfloat2 &v) {
#if QK_K == 256
const block_iq1_s * x = (const block_iq1_s *) vx;
auto dequantize_one = [&](const int idx) -> dfloat {
const int ib8 = idx / 32;
const int r = idx % 32;
const int il = r / 8;
const int j = r % 8;
const float delta = (x[ib].qh[ib8] & 0x8000) ? (-1.f - IQ1S_DELTA) : (-1.f + IQ1S_DELTA);
const float d = (float) x[ib].d * (2 * ((x[ib].qh[ib8] >> 12) & 7) + 1);
const uint16_t grid_id = x[ib].qs[4 * ib8 + il] | (((x[ib].qh[ib8] >> (3 * il)) & 7) << 8);
const uint32_t g = iq1s_grid_gpu[grid_id];
const int8_t qv = (j < 4) ? ((g >> (8 * j)) & 0x0F) : ((g >> (8 * (j - 4) + 4)) & 0x0F);
return d * (qv + delta);
};
v.x() = dequantize_one(iqs + 0);
v.y() = dequantize_one(iqs + 1);
#else
GGML_ABORT("IQ1_S dequantize not supported for QK_K != 256");
#endif
}
static __dpct_inline__ void dequantize_iq1_m(const void *vx, const int64_t ib,
const int iqs, dfloat2 &v) {
#if QK_K == 256
const block_iq1_m * x = (const block_iq1_m *) vx;
auto dequantize_one = [&](const int idx) -> dfloat {
const int ib8 = idx / 32;
const int r = idx % 32;
const int il = r / 8;
const int j = r % 8;
const uint16_t * sc = (const uint16_t *) x[ib].scales;
iq1m_scale_t scale;
scale.u16 = (sc[0] >> 12) | ((sc[1] >> 8) & 0x00f0) | ((sc[2] >> 4) & 0x0f00) | (sc[3] & 0xf000);
const int ib16 = 2 * ib8 + il / 2;
const float d = (float) scale.f16 * (2 * ((sc[ib16 / 4] >> (3 * (ib16 % 4))) & 0x7) + 1);
const uint8_t qh = x[ib].qh[2 * ib8 + il / 2];
const float delta = (qh & (0x08 << (4 * (il % 2)))) ? (-1.f - IQ1M_DELTA) : (-1.f + IQ1M_DELTA);
const uint16_t grid_id = x[ib].qs[4 * ib8 + il] | (((qh >> (4 * (il % 2))) & 7) << 8);
const uint32_t g = iq1s_grid_gpu[grid_id];
const int8_t qv = (j < 4) ? ((g >> (8 * j)) & 0x0F) : ((g >> (8 * (j - 4) + 4)) & 0x0F);
return d * (qv + delta);
};
v.x() = dequantize_one(iqs + 0);
v.y() = dequantize_one(iqs + 1);
#else
GGML_ABORT("IQ1_M dequantize not supported for QK_K != 256");
#endif
}
static __dpct_inline__ void dequantize_iq4_nl(const void *vx, const int64_t ib,
const int iqs, dfloat2 &v) {
const block_iq4_nl * x = (const block_iq4_nl *) vx;
const float d = (float) x[ib].d;
auto dequantize_one = [&](const int idx) -> dfloat {
if (idx < 16) {
return d * kvalues_iq4nl[x[ib].qs[idx] & 0xF];
}
return d * kvalues_iq4nl[x[ib].qs[idx - 16] >> 4];
};
v.x() = dequantize_one(iqs + 0);
v.y() = dequantize_one(iqs + 1);
}
static __dpct_inline__ void dequantize_iq4_xs(const void *vx, const int64_t ib,
const int iqs, dfloat2 &v) {
#if QK_K == 256
const block_iq4_xs * x = (const block_iq4_xs *) vx;
auto dequantize_one = [&](const int idx) -> dfloat {
const int ib8 = idx / 32;
const int r = idx % 32;
const int byte_idx = (r < 16) ? r : (r - 16);
const uint8_t q = x[ib].qs[16 * ib8 + byte_idx];
const uint8_t qv = (r < 16) ? (q & 0x0F) : (q >> 4);
const float d = (float) x[ib].d * ((((x[ib].scales_l[ib8 / 2] >> (4 * (ib8 % 2))) & 0xf) |
(((x[ib].scales_h >> (2 * ib8)) & 3) << 4)) - 32);
return d * kvalues_iq4nl[qv];
};
v.x() = dequantize_one(iqs + 0);
v.y() = dequantize_one(iqs + 1);
#else
GGML_ABORT("IQ4_XS dequantize not supported for QK_K != 256");
#endif
}
static __dpct_inline__ void dequantize_q5_0(const void *vx, const int64_t ib,
const int iqs, dfloat2 &v) {
const block_q5_0 * x = (const block_q5_0 *) vx;
@@ -390,6 +862,63 @@ static void dequantize_block_q3_K(const void * __restrict__ vx, dst_t * __restri
}
template<typename dst_t>
static void dequantize_block_q3_K_reorder(const void * __restrict__ vx, dst_t * __restrict__ yy,
const sycl::nd_item<3> & item_ct1, int64_t n_blocks) {
#if QK_K == 256
const int64_t i = item_ct1.get_group(2);
if (i >= n_blocks) {
return;
}
const uint8_t * base = static_cast<const uint8_t *>(vx);
const size_t qs_offset = i * (QK_K / 4);
const size_t hmask_offset = n_blocks * (QK_K / 4) + i * (QK_K / 8);
const size_t scales_offset = n_blocks * (QK_K / 4) + n_blocks * (QK_K / 8) + i * 12;
const size_t d_offset = n_blocks * (QK_K / 4) + n_blocks * (QK_K / 8) + n_blocks * 12 +
i * sizeof(ggml_half);
const uint8_t * qs = base + qs_offset;
const uint8_t * hmask = base + hmask_offset;
const uint8_t * scales = base + scales_offset;
const float d_all = static_cast<float>(*reinterpret_cast<const ggml_half *>(base + d_offset));
const int64_t r = item_ct1.get_local_id(2) / 4;
const int64_t tid = r / 2;
const int64_t is0 = r % 2;
const int64_t l0 = 16 * is0 + 4 * (item_ct1.get_local_id(2) % 4);
const int64_t n = tid / 4;
const int64_t j = tid - 4 * n;
const int64_t is = 8 * n + 2 * j + is0;
const int shift = 2 * j;
uint8_t m = 1 << (4 * n + j);
uint8_t us = is < 4
? (scales[is - 0] & 0xF) | (((scales[is + 8] >> 0) & 3) << 4)
: is < 8
? (scales[is - 0] & 0xF) | (((scales[is + 4] >> 2) & 3) << 4)
: is < 12
? (scales[is - 8] >> 4) | (((scales[is + 0] >> 4) & 3) << 4)
: (scales[is - 8] >> 4) | (((scales[is - 4] >> 6) & 3) << 4);
const float dl = d_all * (us - 32);
dst_t * y = yy + i * QK_K + 128 * n + 32 * j;
const uint8_t * q = qs + 32 * n;
const uint8_t * hm = hmask;
for (int l = l0; l < l0 + 4; ++l) {
y[l] = dl * ((int8_t) ((q[l] >> shift) & 3) - ((hm[l] & m) ? 0 : 4));
}
#else
GGML_UNUSED(vx);
GGML_UNUSED(yy);
GGML_UNUSED(item_ct1);
GGML_UNUSED(n_blocks);
GGML_ABORT("Q3_K reorder dequantize not supported for QK_K != 256");
#endif
}
#if QK_K == 256
static inline void get_scale_min_k4(int j, const uint8_t * q, uint8_t & d, uint8_t & m) {
if (j < 4) {
+119 -1
View File
@@ -501,6 +501,103 @@ static void dequantize_mul_mat_vec_q3_k(const void *__restrict__ vx,
}
}
static void dequantize_mul_mat_vec_q3_k_reorder(const void *__restrict__ vx,
const float *__restrict__ yy,
float *__restrict__ dst,
const int ncols, int nrows,
const sycl::nd_item<3> &item_ct1) {
const int row = item_ct1.get_group(2) * item_ct1.get_local_range(1) +
item_ct1.get_local_id(1);
if (row > nrows) return;
const int num_blocks_per_row = ncols / QK_K;
const int ib0 = row*num_blocks_per_row;
// SOA base pointers for the reordered layout:
// [qs: nb * (QK_K/4)] [hmask: nb * (QK_K/8)] [scales: nb * 12] [d: nb * sizeof(half)]
const int nb = nrows * num_blocks_per_row;
const uint8_t * qs_base = (const uint8_t *)vx;
const uint8_t * hmask_base = qs_base + (size_t)nb * (QK_K / 4);
const uint8_t * scales_base = hmask_base + (size_t)nb * (QK_K / 8);
const sycl::half * d_base = (const sycl::half *)(scales_base + (size_t)nb * 12);
float tmp = 0; // partial sum for thread in warp
#if QK_K == 256
const uint16_t kmask1 = 0x0303;
const uint16_t kmask2 = 0x0f0f;
const int tid =
item_ct1.get_local_id(2) / K_QUANTS_PER_ITERATION; // 0...31 or 0...16
const int ix =
item_ct1.get_local_id(2) % K_QUANTS_PER_ITERATION; // 0 or 0,1
const int n = K_QUANTS_PER_ITERATION; // iterations in the inner loop
const int step = 16/K_QUANTS_PER_ITERATION;
const int im = tid/step; // 0 or 1. 0 computes 0..., 1 computes 128...
const int in = tid - step*im; // 0....15 or 0...7
const uint8_t m = 1 << (4*im);
const int l0 = n*in; // 0...15 or 0...14 in steps of 2
const int q_offset = 32*im + l0;
const int y_offset = 128*im + l0;
uint16_t utmp[4];
const int8_t * s = (const int8_t *)utmp;
const uint16_t s_shift = 4*im;
for (int i = ix; i < num_blocks_per_row; i += K_QUANTS_PER_ITERATION) {
const int bi = ib0 + i;
const float * y = yy + i * QK_K + y_offset;
const uint8_t * q = qs_base + bi * (QK_K / 4) + q_offset;
const uint8_t * h = hmask_base + bi * (QK_K / 8) + l0;
const uint16_t * a = (const uint16_t *)(scales_base + bi * 12);
utmp[0] = ((a[0] >> s_shift) & kmask2) | (((a[4] >> (s_shift + 0)) & kmask1) << 4);
utmp[1] = ((a[1] >> s_shift) & kmask2) | (((a[5] >> (s_shift + 0)) & kmask1) << 4);
utmp[2] = ((a[2] >> s_shift) & kmask2) | (((a[4] >> (s_shift + 2)) & kmask1) << 4);
utmp[3] = ((a[3] >> s_shift) & kmask2) | (((a[5] >> (s_shift + 2)) & kmask1) << 4);
const float d = d_base[bi];
float sum = 0;
for (int l = 0; l < n; ++l) {
sum += y[l+ 0] * (s[0] - 32) * (((q[l] >> 0) & 3) - (h[l] & (m << 0) ? 0 : 4))
+ y[l+32] * (s[2] - 32) * (((q[l] >> 2) & 3) - (h[l] & (m << 1) ? 0 : 4))
+ y[l+64] * (s[4] - 32) * (((q[l] >> 4) & 3) - (h[l] & (m << 2) ? 0 : 4))
+ y[l+96] * (s[6] - 32) * (((q[l] >> 6) & 3) - (h[l] & (m << 3) ? 0 : 4));
sum += y[l+16] * (s[1] - 32) * (((q[l+16] >> 0) & 3) - (h[l+16] & (m << 0) ? 0 : 4))
+ y[l+48] * (s[3] - 32) * (((q[l+16] >> 2) & 3) - (h[l+16] & (m << 1) ? 0 : 4))
+ y[l+80] * (s[5] - 32) * (((q[l+16] >> 4) & 3) - (h[l+16] & (m << 2) ? 0 : 4))
+ y[l+112] * (s[7] - 32) * (((q[l+16] >> 6) & 3) - (h[l+16] & (m << 3) ? 0 : 4));
}
tmp += d * sum;
}
#else
GGML_UNUSED(vx);
GGML_UNUSED(yy);
GGML_UNUSED(ncols);
GGML_UNUSED(item_ct1);
GGML_ABORT("Q3_K reorder DMMV not supported for QK_K != 256");
#endif
// sum up partial sums and write back result
#pragma unroll
for (int mask = QK_WARP_SIZE / 2; mask > 0; mask >>= 1) {
tmp +=
dpct::permute_sub_group_by_xor(item_ct1.get_sub_group(), tmp, mask);
}
if (item_ct1.get_local_id(2) == 0) {
dst[row] = tmp;
}
}
/*
DPCT1110:6: The total declared local variable size in device function
dequantize_mul_mat_vec_q4_k exceeds 128 bytes and may cause high register
@@ -1440,6 +1537,22 @@ static void dequantize_mul_mat_vec_q3_K_sycl(const void *vx, const float *y,
});
}
static void dequantize_mul_mat_vec_q3_K_sycl_reorder(const void *vx, const float *y,
float *dst, const int ncols,
const int nrows,
dpct::queue_ptr stream) {
GGML_ASSERT(ncols % QK_K == 0);
const int ny = 2 / K_QUANTS_PER_ITERATION;
const int block_num_y = (nrows + ny - 1) / ny;
const sycl::range<3> block_nums(1, 1, block_num_y);
const sycl::range<3> block_dims(1, ny, QK_WARP_SIZE);
stream->parallel_for(
sycl::nd_range<3>(block_nums * block_dims, block_dims),
[=](sycl::nd_item<3> item_ct1) [[sycl::reqd_sub_group_size(QK_WARP_SIZE)]] {
dequantize_mul_mat_vec_q3_k_reorder(vx, y, dst, ncols, nrows, item_ct1);
});
}
static void dequantize_mul_mat_vec_q4_K_sycl(const void *vx, const float *y,
float *dst, const int ncols,
const int nrows,
@@ -1581,7 +1694,12 @@ void ggml_sycl_op_dequantize_mul_mat_vec(
dequantize_mul_mat_vec_q2_K_sycl(src0_dd_i, src1_ddf_i, dst_dd_i, ne00, row_diff, stream);
break;
case GGML_TYPE_Q3_K:
dequantize_mul_mat_vec_q3_K_sycl(src0_dd_i, src1_ddf_i, dst_dd_i, ne00, row_diff, stream);
if ((ggml_tensor_extra_gpu *) dst->src[0]->extra &&
((ggml_tensor_extra_gpu *) dst->src[0]->extra)->optimized_feature.reorder) {
dequantize_mul_mat_vec_q3_K_sycl_reorder(src0_dd_i, src1_ddf_i, dst_dd_i, ne00, row_diff, stream);
} else {
dequantize_mul_mat_vec_q3_K_sycl(src0_dd_i, src1_ddf_i, dst_dd_i, ne00, row_diff, stream);
}
break;
case GGML_TYPE_Q4_K:
if ((ggml_tensor_extra_gpu *) dst->src[0]->extra &&
+3 -3
View File
@@ -1031,7 +1031,7 @@ void launch_fattn(
auto KV_max_ptr_ct1 = KV_max.ptr;
cgh.parallel_for(sycl::nd_range<3>(blocks_num_KV_max * block_dim_KV_max, block_dim_KV_max),
[=](sycl::nd_item<3> item_ct1) {
[=](sycl::nd_item<3> item_ct1) [[sycl::reqd_sub_group_size(warp_size)]] {
GGML_UNUSED(item_ct1);
flash_attn_mask_to_KV_max<ncols1, warp_size>(
mask_data_ct0, KV_max_ptr_ct1, iter_k, s31, s33,
@@ -1149,7 +1149,7 @@ void launch_fattn(
auto K_ne_ct6 = K->ne[2];
cgh.parallel_for(sycl::nd_range<3>(blocks_num_combine * block_dim_combine, block_dim_combine),
[=](sycl::nd_item<3> item_ct1) {
[=](sycl::nd_item<3> item_ct1) [[sycl::reqd_sub_group_size(warp_size)]] {
GGML_UNUSED(item_ct1);
flash_attn_stream_k_fixup<DV, ncols1, ncols2>(KQV_data_ct0, dst_tmp_meta_ptr_ct1,
Q_ne_ct2, Q_ne_ct3, Q_ne_ct4,
@@ -1169,7 +1169,7 @@ void launch_fattn(
auto KQV_data_ct2 = (float *) KQV->data;
cgh.parallel_for(sycl::nd_range<3>(blocks_num_combine * block_dim_combine, block_dim_combine),
[=](sycl::nd_item<3> item_ct1) {
[=](sycl::nd_item<3> item_ct1) [[sycl::reqd_sub_group_size(warp_size)]] {
GGML_UNUSED(item_ct1);
flash_attn_combine_results<DV>(
dst_tmp_ptr_ct0, dst_tmp_meta_ptr_ct1, KQV_data_ct2, parallel_blocks,
+75 -3
View File
@@ -129,11 +129,11 @@ static void get_rows_sycl(ggml_backend_sycl_context & ctx, const ggml_tensor *sr
GGML_UNUSED(ctx);
}
template <typename src0_t>
template <typename src0_t, typename dst_t>
static void get_rows_sycl_float(ggml_backend_sycl_context & ctx, const ggml_tensor *src0,
const ggml_tensor *src1, ggml_tensor *dst,
const src0_t *src0_dd, const int32_t *src1_dd,
float *dst_dd, queue_ptr stream) {
dst_t *dst_dd, queue_ptr stream) {
GGML_TENSOR_BINARY_OP_LOCALS
@@ -170,7 +170,7 @@ static void get_rows_sycl_float(ggml_backend_sycl_context & ctx, const ggml_tens
void ggml_sycl_op_get_rows(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_ASSERT(dst->src[1]->type == GGML_TYPE_I32);
GGML_ASSERT(dst->type == GGML_TYPE_F32);
GGML_ASSERT(dst->type == GGML_TYPE_F32 || dst->type == GGML_TYPE_I32 );
GGML_ASSERT(dst->src[0]->nb[0] == ggml_type_size(dst->src[0]->type));
GGML_ASSERT(dst->src[1]->nb[0] == ggml_type_size(dst->src[1]->type));
@@ -191,6 +191,66 @@ void ggml_sycl_op_get_rows(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
get_rows_sycl_float(ctx, dst->src[0], dst->src[1], dst, (const float *)dst->src[0]->data,
src1_i32, (float *)dst->data, ctx.stream());
break;
case GGML_TYPE_I32:
get_rows_sycl_float(ctx, dst->src[0], dst->src[1], dst, (const int32_t *)dst->src[0]->data,
src1_i32, (int32_t *)dst->data, ctx.stream());
break;
case GGML_TYPE_Q1_0:
get_rows_sycl<QK1_0, 1, dequantize_q1_0>(ctx, dst->src[0], dst->src[1], dst, (const float *)dst->src[0]->data,
src1_i32, (float *)dst->data, ctx.stream());
break;
case GGML_TYPE_MXFP4:
get_rows_sycl<QK_MXFP4, 2, dequantize_mxfp4>(ctx, dst->src[0], dst->src[1], dst, (const float *)dst->src[0]->data,
src1_i32, (float *)dst->data, ctx.stream());
break;
case GGML_TYPE_NVFP4:
get_rows_sycl<QK_NVFP4, 1, dequantize_nvfp4>(ctx, dst->src[0], dst->src[1], dst, (const float *)dst->src[0]->data,
src1_i32, (float *)dst->data, ctx.stream());
break;
case GGML_TYPE_IQ2_XXS:
get_rows_sycl<QK_K, 1, dequantize_iq2_xxs>(ctx, dst->src[0], dst->src[1], dst, (const float *)dst->src[0]->data,
src1_i32, (float *)dst->data, ctx.stream());
break;
case GGML_TYPE_IQ2_XS:
get_rows_sycl<QK_K, 1, dequantize_iq2_xs>(ctx, dst->src[0], dst->src[1], dst, (const float *)dst->src[0]->data,
src1_i32, (float *)dst->data, ctx.stream());
break;
case GGML_TYPE_IQ2_S:
get_rows_sycl<QK_K, 1, dequantize_iq2_s>(ctx, dst->src[0], dst->src[1], dst, (const float *)dst->src[0]->data,
src1_i32, (float *)dst->data, ctx.stream());
break;
case GGML_TYPE_IQ3_XXS:
get_rows_sycl<QK_K, 1, dequantize_iq3_xxs>(ctx, dst->src[0], dst->src[1], dst, (const float *)dst->src[0]->data,
src1_i32, (float *)dst->data, ctx.stream());
break;
case GGML_TYPE_IQ1_S:
get_rows_sycl<QK_K, 1, dequantize_iq1_s>(ctx, dst->src[0], dst->src[1], dst, (const float *)dst->src[0]->data,
src1_i32, (float *)dst->data, ctx.stream());
break;
case GGML_TYPE_IQ1_M:
get_rows_sycl<QK_K, 1, dequantize_iq1_m>(ctx, dst->src[0], dst->src[1], dst, (const float *)dst->src[0]->data,
src1_i32, (float *)dst->data, ctx.stream());
break;
case GGML_TYPE_IQ3_S:
get_rows_sycl<QK_K, 1, dequantize_iq3_s>(ctx, dst->src[0], dst->src[1], dst, (const float *)dst->src[0]->data,
src1_i32, (float *)dst->data, ctx.stream());
break;
case GGML_TYPE_IQ4_NL:
get_rows_sycl<QK4_NL, 1, dequantize_iq4_nl>(ctx, dst->src[0], dst->src[1], dst, (const float *)dst->src[0]->data,
src1_i32, (float *)dst->data, ctx.stream());
break;
case GGML_TYPE_IQ4_XS:
get_rows_sycl<QK_K, 1, dequantize_iq4_xs>(ctx, dst->src[0], dst->src[1], dst, (const float *)dst->src[0]->data,
src1_i32, (float *)dst->data, ctx.stream());
break;
case GGML_TYPE_Q2_K:
get_rows_sycl<QK_K, 1, dequantize_q2_K>(ctx, dst->src[0], dst->src[1], dst, (const float *)dst->src[0]->data,
src1_i32, (float *)dst->data, ctx.stream());
break;
case GGML_TYPE_Q3_K:
get_rows_sycl<QK_K, 1, dequantize_q3_K>(ctx, dst->src[0], dst->src[1], dst, (const float *)dst->src[0]->data,
src1_i32, (float *)dst->data, ctx.stream());
break;
case GGML_TYPE_Q4_0:
get_rows_sycl<QK4_0, QR4_0, dequantize_q4_0>(ctx, dst->src[0], dst->src[1], dst, (const float *)dst->src[0]->data,
src1_i32, (float *)dst->data, ctx.stream());
@@ -199,6 +259,10 @@ void ggml_sycl_op_get_rows(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
get_rows_sycl<QK4_1, QR4_1, dequantize_q4_1>(ctx, dst->src[0], dst->src[1], dst, (const float *)dst->src[0]->data,
src1_i32, (float *)dst->data, ctx.stream());
break;
case GGML_TYPE_Q4_K:
get_rows_sycl<QK_K, 1, dequantize_q4_K>(ctx, dst->src[0], dst->src[1], dst, (const float *)dst->src[0]->data,
src1_i32, (float *)dst->data, ctx.stream());
break;
case GGML_TYPE_Q5_0:
get_rows_sycl<QK5_0, QR5_0, dequantize_q5_0>(ctx, dst->src[0], dst->src[1], dst, (const float *)dst->src[0]->data,
src1_i32, (float *)dst->data, ctx.stream());
@@ -207,6 +271,14 @@ void ggml_sycl_op_get_rows(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
get_rows_sycl<QK5_1, QR5_1, dequantize_q5_1>(ctx, dst->src[0], dst->src[1], dst, (const float *)dst->src[0]->data,
src1_i32, (float *)dst->data, ctx.stream());
break;
case GGML_TYPE_Q5_K:
get_rows_sycl<QK_K, 1, dequantize_q5_K>(ctx, dst->src[0], dst->src[1], dst, (const float *)dst->src[0]->data,
src1_i32, (float *)dst->data, ctx.stream());
break;
case GGML_TYPE_Q6_K:
get_rows_sycl<QK_K, 1, dequantize_q6_K>(ctx, dst->src[0], dst->src[1], dst, (const float *)dst->src[0]->data,
src1_i32, (float *)dst->data, ctx.stream());
break;
case GGML_TYPE_Q8_0:
get_rows_sycl<QK8_0, QR8_0, dequantize_q8_0>(ctx, dst->src[0], dst->src[1], dst, (const float *)dst->src[0]->data,
src1_i32, (float *)dst->data, ctx.stream());
+70
View File
@@ -3549,6 +3549,7 @@ inline bool ggml_sycl_supports_reorder_mul_mat_sycl(enum ggml_type type) {
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q8_0:
return true;
case GGML_TYPE_Q3_K:
case GGML_TYPE_Q4_K:
case GGML_TYPE_Q5_K:
case GGML_TYPE_Q6_K:
@@ -3572,6 +3573,7 @@ inline bool ggml_sycl_supports_reorder_mmvq(enum ggml_type type) {
switch (type) {
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q8_0:
case GGML_TYPE_Q3_K:
case GGML_TYPE_Q4_K:
case GGML_TYPE_Q5_K:
case GGML_TYPE_Q6_K:
@@ -3791,6 +3793,54 @@ static bool reorder_qw_q4_k(uint8_t * data_device, size_t size, size_t offset, d
return true;
}
static bool reorder_qw_q3_k(uint8_t * data_device, size_t size, size_t offset, dpct::queue_ptr stream) {
GGML_ASSERT(size % sizeof(block_q3_K) == 0);
GGML_ASSERT(offset % sizeof(block_q3_K) == 0);
const int nblocks = size / sizeof(block_q3_K);
sycl_reorder_temp_buffer tmp(stream, size);
if (!tmp) {
GGML_LOG_WARN("%s: failed to allocate %zu bytes for reorder temp buffer, skipping reorder\n", __func__, size);
return false;
}
uint8_t * tmp_buf = static_cast<uint8_t *>(tmp.ptr);
sycl::event copy_event;
SYCL_CHECK(CHECK_TRY_ERROR(copy_event = stream->memcpy(tmp_buf, data_device, size)));
if (!g_ggml_sycl_use_async_mem_op) {
copy_event.wait();
}
auto * qs_ptr = data_device;
auto * hmask_ptr = qs_ptr + (QK_K / 4) * nblocks;
auto * scales_ptr = hmask_ptr + (QK_K / 8) * nblocks;
sycl::half * d_ptr = (sycl::half *) (scales_ptr + 12 * nblocks);
auto reorder_event = stream->parallel_for(nblocks, [=](auto i) {
const block_q3_K * x = (const block_q3_K *) tmp_buf;
const int ib = i;
for (int j = 0; j < QK_K / 4; ++j) {
qs_ptr[ib * (QK_K / 4) + j] = x[ib].qs[j];
}
for (int j = 0; j < QK_K / 8; ++j) {
hmask_ptr[ib * (QK_K / 8) + j] = x[ib].hmask[j];
}
for (int j = 0; j < 12; ++j) {
scales_ptr[ib * 12 + j] = x[ib].scales[j];
}
d_ptr[ib] = x[ib].d;
});
if (!g_ggml_sycl_use_async_mem_op) {
reorder_event.wait_and_throw();
}
return true;
}
static bool reorder_qw_q5_k(uint8_t * data_device, size_t size, size_t offset, dpct::queue_ptr stream) {
GGML_ASSERT(size % sizeof(block_q5_K) == 0);
GGML_ASSERT(offset % sizeof(block_q5_K) == 0);
@@ -3903,6 +3953,8 @@ static bool reorder_qw(const ggml_tensor * src0, dpct::queue_ptr stream) {
return reorder_qw_q4_0(data_device, ncols, nrows, size, 0, stream);
case GGML_TYPE_Q8_0:
return reorder_qw_q8_0(data_device, ncols, nrows, size, 0, stream);
case GGML_TYPE_Q3_K:
return reorder_qw_q3_k(data_device, size, 0, stream);
case GGML_TYPE_Q4_K:
return reorder_qw_q4_k(data_device, size, 0, stream);
case GGML_TYPE_Q5_K:
@@ -5249,13 +5301,31 @@ static bool ggml_backend_sycl_device_supports_op(ggml_backend_dev_t dev, const g
case GGML_OP_GET_ROWS:
{
switch (op->src[0]->type) {
case GGML_TYPE_I32:
case GGML_TYPE_F16:
case GGML_TYPE_BF16:
case GGML_TYPE_F32:
case GGML_TYPE_Q1_0:
case GGML_TYPE_MXFP4:
case GGML_TYPE_NVFP4:
case GGML_TYPE_IQ2_XXS:
case GGML_TYPE_IQ2_XS:
case GGML_TYPE_IQ2_S:
case GGML_TYPE_IQ3_XXS:
case GGML_TYPE_IQ1_S:
case GGML_TYPE_IQ1_M:
case GGML_TYPE_IQ3_S:
case GGML_TYPE_IQ4_NL:
case GGML_TYPE_IQ4_XS:
case GGML_TYPE_Q2_K:
case GGML_TYPE_Q3_K:
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1:
case GGML_TYPE_Q4_K:
case GGML_TYPE_Q5_0:
case GGML_TYPE_Q5_1:
case GGML_TYPE_Q5_K:
case GGML_TYPE_Q6_K:
case GGML_TYPE_Q8_0:
return true;
default:
+29 -1
View File
@@ -770,6 +770,26 @@ static void mul_mat_vec_q3_K_q8_1_sycl(const void *vx, const void *vy,
}
}
static void reorder_mul_mat_vec_q3_k_q8_1_sycl(const void * vx, const void * vy, float * dst, const int ncols,
const int nrows, dpct::queue_ptr stream) {
GGML_ASSERT(ncols % QK_K == 0);
// Round up to a whole number of subgroup-sized workgroups; out-of-range rows are skipped inside the kernel.
constexpr size_t num_subgroups = WARP_SIZE;
const int block_num_y = ceil_div(nrows, GGML_SYCL_MMV_Y * (int) num_subgroups) * (int) num_subgroups;
const sycl::range<3> global_size(1, GGML_SYCL_MMV_Y, block_num_y * WARP_SIZE);
const sycl::range<3> workgroup_size(1, GGML_SYCL_MMV_Y, num_subgroups * WARP_SIZE);
stream->submit([&](sycl::handler & cgh) {
cgh.parallel_for(sycl::nd_range<3>(global_size, workgroup_size),
[=](sycl::nd_item<3> nd_item) [[sycl::reqd_sub_group_size(WARP_SIZE)]] {
mul_mat_vec_q_reorder<reorder_vec_dot_q_sycl<GGML_TYPE_Q3_K>>(vx, vy, dst, ncols, nrows,
nd_item);
});
});
}
static void mul_mat_vec_q4_K_q8_1_sycl(const void *vx, const void *vy,
float *dst, const int ncols,
const int nrows,
@@ -1153,7 +1173,15 @@ void ggml_sycl_op_mul_mat_vec_q(ggml_backend_sycl_context & ctx, const ggml_tens
mul_mat_vec_q2_K_q8_1_sycl(src0_dd_i, src1_ddq_i_bs, dst_dd_i_bs, ne00, row_diff, stream);
break;
case GGML_TYPE_Q3_K:
mul_mat_vec_q3_K_q8_1_sycl(src0_dd_i, src1_ddq_i_bs, dst_dd_i_bs, ne00, row_diff, stream);
if ((ggml_tensor_extra_gpu *) dst->src[0]->extra &&
((ggml_tensor_extra_gpu *) dst->src[0]->extra)->optimized_feature.reorder) {
GGML_SYCL_DEBUG("Calling reorder_mul_mat_vec_q3_k_q8_1_sycl\n");
reorder_mul_mat_vec_q3_k_q8_1_sycl(src0_dd_i, src1_ddq_i_bs, dst_dd_i_bs, ne00, row_diff,
stream);
} else {
GGML_SYCL_DEBUG("Calling mul_mat_vec_q3_K_q8_1_sycl\n");
mul_mat_vec_q3_K_q8_1_sycl(src0_dd_i, src1_ddq_i_bs, dst_dd_i_bs, ne00, row_diff, stream);
}
break;
case GGML_TYPE_Q4_K:
if ((ggml_tensor_extra_gpu *) dst->src[0]->extra &&
+25
View File
@@ -58,6 +58,31 @@ template <> struct block_q_t<GGML_TYPE_Q4_0> {
static constexpr int block_to_q8_1_ratio() { return traits::qk / QK8_1; }
};
template <> struct block_q_t<GGML_TYPE_Q3_K> {
struct traits {
static constexpr uint32_t qk = QK_K;
static constexpr uint32_t qi = QI3_K;
static constexpr uint32_t qr = QR3_K;
static constexpr uint32_t vdr_mmvq = 1;
};
// Reordered layout: [qs (QK_K/4 per block)] [hmask (QK_K/8 per block)] [scales] [d]
static constexpr std::pair<int, int> get_block_offset(const int block_index, const int n_blocks) {
auto qs_offset = block_index * (QK_K / 4);
auto hmask_offset = n_blocks * (QK_K / 4) + block_index * (QK_K / 8);
return { qs_offset, hmask_offset };
}
static constexpr std::pair<int, int> get_d_offset(int nrows, int ncols, const int block_index) {
auto nblocks = (nrows * (ncols / QK_K));
auto total_qs_bytes = nblocks * (QK_K / 4) + nblocks * (QK_K / 8);
return { total_qs_bytes + block_index * 12,
total_qs_bytes + nblocks * 12 + block_index * sizeof(ggml_half) };
}
static constexpr int block_to_q8_1_ratio() { return traits::qk / QK8_1; }
};
template <> struct block_q_t<GGML_TYPE_Q4_K> {
struct traits {
static constexpr uint32_t qk = QK_K;
+35
View File
@@ -394,6 +394,41 @@ template <> struct reorder_vec_dot_q_sycl<GGML_TYPE_Q8_0> {
}
};
template <> struct reorder_vec_dot_q_sycl<GGML_TYPE_Q3_K> {
static constexpr ggml_type gtype = GGML_TYPE_Q3_K;
using q3_k_block = ggml_sycl_reordered::block_q_t<GGML_TYPE_Q3_K>;
using q3_k_traits = typename q3_k_block::traits;
__dpct_inline__ float operator()(const void * __restrict__ vbq, const std::pair<int, int> ibx_offset,
const std::pair<int, int> d_offset, const int8_t * q8_1_quant_ptr,
const sycl::half2 * q8_1_ds, const int & iqs) {
const uint8_t * base = static_cast<const uint8_t *>(vbq);
const uint8_t * qs = base + ibx_offset.first;
const uint8_t * hmask = base + ibx_offset.second;
const uint8_t * scales = base + d_offset.first;
const ggml_half d = *reinterpret_cast<const ggml_half *>(base + d_offset.second);
const int bq8_offset = QR3_K * (iqs / (QI3_K / 2));
const int scale_offset = iqs - iqs % QI8_1 + (iqs % QI8_1) / (QI8_1 / 2);
const int vl = get_int_from_uint8(qs, iqs);
const int vh = ~get_int_from_uint8(hmask, iqs % (QI3_K / 2)) >> bq8_offset;
int u[QR3_K];
float d8[QR3_K];
#pragma unroll
for (int i = 0; i < QR3_K; ++i) {
const int8_t * quant_base_ptr = q8_1_quant_ptr + (bq8_offset + i) * QK8_1;
u[i] = get_int_from_int8_aligned(quant_base_ptr, iqs % QI8_1);
d8[i] = (*(q8_1_ds + bq8_offset + i))[0];
}
return vec_dot_q3_K_q8_1_impl_mmvq(vl, vh, u, scales, scale_offset, static_cast<float>(d), d8);
}
};
static inline float vec_dot_q4_K_q8_1_common(const int * __restrict__ q4, const uint16_t * __restrict__ scales,
const ggml_half2 & dm, const block_q8_1 * __restrict__ bq8_1,
const int & iqs) {
+321 -106
View File
@@ -62,8 +62,10 @@ typedef struct VkPhysicalDeviceCooperativeMatrixDecodeVectorFeaturesNV {
#include <map>
#include <set>
#include <unordered_map>
#include <shared_mutex>
#include <mutex>
#include <future>
#include <condition_variable>
#include <thread>
#if defined(_MSC_VER)
@@ -158,8 +160,9 @@ struct vk_pipeline_struct {
uint32_t align;
// true if fields have been set by ggml_vk_create_pipeline
bool initialized {};
// set to true to request the pipeline is compiled
std::atomic<bool> needed {};
// true while a compile is in flight, used to dedupe concurrent claims.
// Protected by device->compile_mutex.
bool compile_pending {};
// set to true when the shader has been compiled
std::atomic<bool> compiled {};
// number of registers used, extracted from pipeline executable properties
@@ -618,6 +621,14 @@ static constexpr std::initializer_list<std::array<int, 3>> rms_norm_mul_rope_vie
struct vk_device_struct {
std::recursive_mutex mutex;
mutable std::shared_mutex pinned_memory_mutex;
// Guards compile_pending, all_pipelines, and the dynamic pipeline maps
// (flash_attn, fa_mask_opt, solve_tri, conv2d, etc). The actual compile
// runs with no lock held, so different pipelines can compile in parallel.
// Lock order is device->mutex -> compile_mutex, never the reverse.
std::mutex compile_mutex;
std::condition_variable compile_cv;
vk::PhysicalDevice physical_device;
vk::PhysicalDeviceProperties properties;
@@ -691,6 +702,7 @@ struct vk_device_struct {
uint32_t coopmat_int_k;
bool coopmat2;
bool coopmat2_bf16_support {};
bool coopmat2_decode_vector;
bool pipeline_executable_properties_support {};
@@ -860,6 +872,7 @@ struct vk_device_struct {
vk_pipeline pipeline_argsort_large_f32[num_argsort_pipelines];
vk_pipeline pipeline_topk_f32[num_topk_pipelines];
vk_pipeline pipeline_sum_rows_f32;
vk_pipeline pipeline_fwht_f32[4];
vk_pipeline pipeline_cumsum_f32;
vk_pipeline pipeline_cumsum_small_f32;
vk_pipeline pipeline_cumsum_multipass1_f32;
@@ -1150,6 +1163,13 @@ struct vk_op_push_constants {
float param4;
};
struct vk_op_fwht_push_constants {
uint32_t n_rows;
uint32_t src_offset;
uint32_t dst_offset;
float scale;
};
struct vk_op_count_experts_push_constants {
uint32_t ne00;
uint32_t ne01;
@@ -1718,7 +1738,7 @@ struct ggml_vk_garbage_collector {
};
static void ggml_vk_preallocate_buffers(ggml_backend_vk_context * ctx, vk_context subctx);
static void ggml_vk_load_shaders(vk_device& device);
static void ggml_vk_load_shaders(vk_device& device, vk_pipeline requested = nullptr);
static void ggml_pipeline_allocate_descriptor_sets(ggml_backend_vk_context * ctx);
static bool vk_memory_logger_enabled = false;
@@ -2055,6 +2075,15 @@ template <> void init_pushconst_tensor_offsets(ggml_backend_vk_context * ctx, vk
GGML_UNUSED(src3);
}
template <> void init_pushconst_tensor_offsets(ggml_backend_vk_context * ctx, vk_op_fwht_push_constants &p, const ggml_tensor * src0, const ggml_tensor * src1, const ggml_tensor * src2, const ggml_tensor * src3, ggml_tensor * dst) {
p.src_offset = get_misalign_bytes(ctx, src0) / ggml_type_size(src0->type);
p.dst_offset = get_misalign_bytes(ctx, dst) / ggml_type_size(dst->type);
GGML_UNUSED(src1);
GGML_UNUSED(src2);
GGML_UNUSED(src3);
}
struct ggml_backend_vk_buffer_context {
vk_device_ref device;
vk_buffer dev_buffer;
@@ -2095,9 +2124,9 @@ void vk_memory_logger::log_deallocation(vk_buffer_ref buf_ref) {
const bool device = bool(buf->memory_property_flags & vk::MemoryPropertyFlagBits::eDeviceLocal);
std::string type = device ? "device" : "host";
auto it = allocations.find(buf->buffer);
total_device -= device ? it->second : 0;
total_host -= device ? 0 : it->second;
if (it != allocations.end()) {
total_device -= device ? it->second : 0;
total_host -= device ? 0 : it->second;
VK_LOG_MEMORY(buf->device->name << ": -" << format_size(it->second) << " " << type << " at " << buf->buffer << ". Total device: " << format_size(total_device) << ", total host: " << format_size(total_host));
allocations.erase(it);
} else {
@@ -2176,11 +2205,6 @@ static void ggml_vk_wait_for_fence(ggml_backend_vk_context * ctx) {
ctx->device->device.resetFences({ ctx->fence });
}
// variables to track number of compiles in progress
static uint32_t compile_count = 0;
static std::mutex compile_count_mutex;
static std::condition_variable compile_count_cond;
static constexpr uint32_t kSpvOpCooperativeMatrixLoadTensorNV = 5367;
static constexpr uint32_t kSpvCapabilityCooperativeMatrixDecodeVectorNV = 5447;
static constexpr uint32_t kSpvTensorAddressingDecodeVectorFuncBit = 0x4;
@@ -2475,7 +2499,6 @@ static void ggml_vk_create_pipeline_func(vk_device& device, vk_pipeline& pipelin
std::cerr << "ggml_vulkan: " << e.what() << std::endl;
throw e;
}
pipeline->compiled = true;
if (vk_instance.debug_utils_support) {
vk::DebugUtilsObjectNameInfoEXT duoni;
@@ -2524,14 +2547,13 @@ static void ggml_vk_create_pipeline_func(vk_device& device, vk_pipeline& pipelin
}
}
device->all_pipelines.push_back(pipeline);
{
std::lock_guard<std::mutex> guard(compile_count_mutex);
assert(compile_count > 0);
compile_count--;
std::lock_guard<std::mutex> guard(device->compile_mutex);
device->all_pipelines.push_back(pipeline);
pipeline->compiled = true;
pipeline->compile_pending = false;
}
compile_count_cond.notify_all();
device->compile_cv.notify_all();
}
static void ggml_vk_destroy_pipeline(vk::Device& device, vk_pipeline& pipeline) {
@@ -2547,8 +2569,7 @@ static void ggml_pipeline_request_descriptor_sets(ggml_backend_vk_context *ctx,
VK_LOG_DEBUG("ggml_pipeline_request_descriptor_sets(" << pipeline->name << ", " << n << ")");
ctx->pipeline_descriptor_set_requirements += n;
if (!pipeline->compiled) {
pipeline->needed = true;
ggml_vk_load_shaders(ctx->device);
ggml_vk_load_shaders(ctx->device, pipeline);
}
ggml_pipeline_allocate_descriptor_sets(ctx);
}
@@ -3122,7 +3143,7 @@ struct vk_fa_tuning_params {
};
static bool ggml_vk_flash_attn_scalar_shmem_support(const vk_device& device, const vk_fa_tuning_params& params, uint32_t hsk, uint32_t hsv, bool f32acc, ggml_type k_type, ggml_type v_type);
static bool ggml_vk_flash_attn_coopmat_shmem_support(const vk_device& device, const vk_fa_tuning_params& params, uint32_t hsk, uint32_t hsv, bool f32acc);
static bool ggml_vk_flash_attn_coopmat_shmem_support(const vk_device& device, const vk_fa_tuning_params& params, uint32_t hsk, uint32_t hsv, bool f32acc, ggml_type k_type = GGML_TYPE_F16);
static vk_fa_tuning_params get_fa_tuning_params_scalar(const vk_device& device, uint32_t hsk, uint32_t hsv, uint32_t n_rows, uint32_t n_kv, ggml_type k_type, ggml_type v_type, bool f32acc) {
@@ -3262,6 +3283,13 @@ static vk_fa_tuning_params get_fa_tuning_params(const vk_device& device, uint32_
FaCodePath path = device->coopmat2 ? FA_COOPMAT2 :
device->coopmat1_fa_support ? FA_COOPMAT1 : FA_SCALAR;
if (path == FA_COOPMAT2 && k_type == GGML_TYPE_BF16 && !device->coopmat2_bf16_support) {
path = FA_COOPMAT1;
}
if (path == FA_COOPMAT1 && k_type == GGML_TYPE_BF16 && !device->coopmat_bf16_support) {
path = FA_SCALAR;
}
if (path == FA_COOPMAT1 && device->architecture == vk_device_architecture::NVIDIA_TURING) {
// Nvidia compiler bug, see https://github.com/ggml-org/llama.cpp/pull/19075#issuecomment-3820716090
path = FA_SCALAR;
@@ -3271,7 +3299,7 @@ static vk_fa_tuning_params get_fa_tuning_params(const vk_device& device, uint32_
bool shape_ok = (f32acc && device->coopmat_support_16x16x16_f32acc) ||
(!f32acc && device->coopmat_support_16x16x16_f16acc);
const vk_fa_tuning_params params = get_fa_tuning_params_coopmat1(device, hsk, hsv, n_rows, n_kv, k_type, v_type, f32acc);
bool shmem_ok = ggml_vk_flash_attn_coopmat_shmem_support(device, params, hsk, hsv, f32acc);
bool shmem_ok = ggml_vk_flash_attn_coopmat_shmem_support(device, params, hsk, hsv, f32acc, k_type);
if (!shape_ok || !shmem_ok) {
path = FA_SCALAR;
@@ -3317,8 +3345,8 @@ static vk_fa_pipeline_state get_fa_pipeline_state(const vk_device& device, const
static std::vector<uint32_t> get_fa_spec_constants(const vk_fa_pipeline_state& state) {
const auto fa_block_bytes = [](ggml_type t) -> uint32_t {
// decodeBufF32 uses a block of vec4s for a better memory access pattern.
return t == GGML_TYPE_F32 ? 16u : (uint32_t) ggml_type_size(t);
if (t == GGML_TYPE_F32) return 16u;
return (uint32_t) ggml_type_size(t);
};
return {
/* 0 WorkGroupSize */ state.workgroup_size,
@@ -3540,10 +3568,26 @@ static bool ggml_vk_fa_scalar_uses_mmq(const vk_device& device, ggml_type k_type
#endif
}
static void ggml_vk_load_shaders(vk_device& device) {
// load_shaders walks the pipeline list under compile_mutex and either claims
// the requested pipeline for compilation or, if another thread is already
// compiling it, drops the lock and waits on compile_cv. Compiles themselves
// run unlocked.
struct CompileTask {
vk_pipeline pipeline;
size_t spv_size;
const void * spv_data;
std::string entrypoint;
uint32_t parameter_count;
std::array<uint32_t, 3> wg_denoms;
std::vector<uint32_t> specialization_constants;
bool disable_robustness;
bool require_full_subgroups;
uint32_t required_subgroup_size;
};
static void ggml_vk_load_shaders(vk_device& device, vk_pipeline requested) {
VK_LOG_DEBUG("ggml_vk_load_shaders(" << device->name << ")");
std::lock_guard<std::recursive_mutex> guard(device->mutex);
// some shaders have a minimum subgroup size
const uint32_t subgroup_size_8 = std::max(device->subgroup_size, 8u);
const uint32_t subgroup_size_16 = std::max(device->subgroup_size, 16u);
@@ -3573,6 +3617,15 @@ static void ggml_vk_load_shaders(vk_device& device) {
l_mmqid_wg_denoms, m_mmqid_wg_denoms, s_mmqid_wg_denoms;
uint32_t l_align, m_align, s_align;
vk_pipeline wait_pipeline;
CompileTask claimed_task {};
bool has_claimed_task = false;
// The rest of the walk reads and writes shared device state, so hold the
// lock until we're done deciding what to compile.
std::unique_lock<std::mutex> compile_lock(device->compile_mutex);
if (device->coopmat2) {
// spec constants and tile sizes for non-quant matmul/matmul_id
l_warptile = { 256, 128, 256, 64, 1 };
@@ -3758,7 +3811,6 @@ static void ggml_vk_load_shaders(vk_device& device) {
device->pipeline_matmul_id_bf16 = std::make_shared<vk_matmul_pipeline_struct>();
}
std::vector<std::future<void>> compiles;
auto const &ggml_vk_create_pipeline = [&](vk_device& device, vk_pipeline& base_pipeline, const char *name, size_t spv_size, const void* spv_data, const char *entrypoint,
uint32_t parameter_count, uint32_t push_constant_size, std::array<uint32_t, 3> wg_denoms, const std::vector<uint32_t>& specialization_constants,
uint32_t align, bool disable_robustness = false, bool require_full_subgroups = false, uint32_t required_subgroup_size = 0) {
@@ -3792,23 +3844,33 @@ static void ggml_vk_load_shaders(vk_device& device) {
#endif
}
if (!pipeline->needed || pipeline->compiled) {
// We only care about the pipeline this call asked for; the rest
// (including the 64-bit indexing variant) are handled by their
// own request_descriptor_sets / load_shaders calls.
if (pipeline.get() != requested.get()) {
continue;
}
// TODO: We're no longer benefitting from the async compiles (shaders are
// compiled individually, as needed) and this complexity can be removed.
{
// wait until fewer than N compiles are in progress
uint32_t N = std::max(1u, std::thread::hardware_concurrency());
std::unique_lock<std::mutex> guard(compile_count_mutex);
while (compile_count >= N) {
compile_count_cond.wait(guard);
}
compile_count++;
if (pipeline->compiled) {
continue;
}
compiles.push_back(std::async(ggml_vk_create_pipeline_func, std::ref(device), std::ref(pipeline), spv_size, spv_data, entrypoint,
parameter_count, wg_denoms, specialization_constants, disable_robustness, require_full_subgroups, required_subgroup_size));
wait_pipeline = pipeline;
if (!pipeline->compile_pending) {
pipeline->compile_pending = true;
claimed_task.pipeline = pipeline;
claimed_task.spv_size = spv_size;
claimed_task.spv_data = spv_data;
claimed_task.entrypoint = entrypoint;
claimed_task.parameter_count = parameter_count;
claimed_task.wg_denoms = wg_denoms;
claimed_task.specialization_constants = specialization_constants;
claimed_task.disable_robustness = disable_robustness;
claimed_task.require_full_subgroups = require_full_subgroups;
claimed_task.required_subgroup_size = required_subgroup_size;
has_claimed_task = true;
}
}
};
@@ -3832,10 +3894,16 @@ static void ggml_vk_load_shaders(vk_device& device) {
const uint32_t fa_sgs = fa.first.subgroup_size;
const bool fa_ds = fa.first.subgroup_size == 0;
const bool bf16_kv = fa.first.k_type == GGML_TYPE_BF16;
const bool use_mmq = ggml_vk_fa_scalar_uses_mmq(device, fa.first.k_type);
const void * spv_data = nullptr;
size_t spv_size = 0;
if (use_mmq) {
const char *name = nullptr;
if (bf16_kv) {
spv_data = flash_attn_f32_f16_fp32_data;
spv_size = flash_attn_f32_f16_fp32_len;
name = aligned ? "flash_attn_f32_bf16_aligned" : "flash_attn_f32_bf16";
} else if (use_mmq) {
#if defined(GGML_VULKAN_INTEGER_DOT_GLSLC_SUPPORT)
if (device->fp16) {
if (f32acc) { spv_data = flash_attn_f32_f16_int8_data; spv_size = flash_attn_f32_f16_int8_len; }
@@ -3845,6 +3913,7 @@ static void ggml_vk_load_shaders(vk_device& device) {
spv_size = flash_attn_f32_f16_fp32_int8_len;
}
#endif
name = aligned ? "flash_attn_f32_f16_aligned" : "flash_attn_f32_f16";
} else {
if (device->fp16) {
if (f32acc) { spv_data = flash_attn_f32_f16_data; spv_size = flash_attn_f32_f16_len; }
@@ -3853,8 +3922,8 @@ static void ggml_vk_load_shaders(vk_device& device) {
spv_data = flash_attn_f32_f16_fp32_data;
spv_size = flash_attn_f32_f16_fp32_len;
}
name = aligned ? "flash_attn_f32_f16_aligned" : "flash_attn_f32_f16";
}
const char *name = aligned ? "flash_attn_f32_f16_aligned" : "flash_attn_f32_f16";
ggml_vk_create_pipeline(device, fa.second, name, spv_size, spv_data, "main", 7,
sizeof(vk_flash_attn_push_constants), {Br, 1, 1},
get_fa_spec_constants(fa.first), aligned ? Bc : 1, true,
@@ -3872,11 +3941,25 @@ static void ggml_vk_load_shaders(vk_device& device) {
const uint32_t fa_sgs = fa.first.subgroup_size;
const bool fa_ds = fa.first.subgroup_size == 0;
const bool bf16_kv = fa.first.k_type == GGML_TYPE_BF16;
const void * spv_data;
size_t spv_size;
if (f32acc) { spv_data = flash_attn_f32_f16_cm1_data; spv_size = flash_attn_f32_f16_cm1_len; }
else { spv_data = flash_attn_f32_f16_f16acc_cm1_data; spv_size = flash_attn_f32_f16_f16acc_cm1_len; }
const char *name = aligned ? "flash_attn_f32_f16_aligned_cm1" : "flash_attn_f32_f16_cm1";
const char *name;
if (bf16_kv) {
#if defined(VK_KHR_shader_bfloat16) && defined(GGML_VULKAN_BFLOAT16_GLSLC_SUPPORT)
if (!device->coopmat_bf16_support) continue;
spv_data = flash_attn_f32_f16_bf16_cm1_data;
spv_size = flash_attn_f32_f16_bf16_cm1_len;
name = aligned ? "flash_attn_f32_bf16_aligned_cm1" : "flash_attn_f32_bf16_cm1";
#else
continue;
#endif
} else {
if (f32acc) { spv_data = flash_attn_f32_f16_cm1_data; spv_size = flash_attn_f32_f16_cm1_len; }
else { spv_data = flash_attn_f32_f16_f16acc_cm1_data; spv_size = flash_attn_f32_f16_f16acc_cm1_len; }
name = aligned ? "flash_attn_f32_f16_aligned_cm1" : "flash_attn_f32_f16_cm1";
}
ggml_vk_create_pipeline(device, fa.second, name, spv_size, spv_data, "main", 7,
sizeof(vk_flash_attn_push_constants), {Br, 1, 1},
get_fa_spec_constants(fa.first), aligned ? Bc : 1, true,
@@ -3894,10 +3977,20 @@ static void ggml_vk_load_shaders(vk_device& device) {
const bool aligned = fa.first.aligned;
const bool f32acc = fa.first.f32acc;
const bool bf16_kv = fa.first.k_type == GGML_TYPE_BF16;
const void * spv_data;
size_t spv_size;
const char * name;
if (aligned) {
if (bf16_kv) {
#if defined(VK_KHR_shader_bfloat16) && defined(GGML_VULKAN_BFLOAT16_GLSLC_SUPPORT)
if (!device->coopmat2_bf16_support) continue;
spv_data = flash_attn_f32_f16_bf16_cm2_data;
spv_size = flash_attn_f32_f16_bf16_cm2_len;
name = aligned ? "flash_attn_f32_bf16_aligned_cm2" : "flash_attn_f32_bf16_cm2";
#else
continue;
#endif
} else if (aligned) {
if (f32acc) { spv_data = flash_attn_f32_f16_cm2_data; spv_size = flash_attn_f32_f16_cm2_len; name = "flash_attn_f32_f16_aligned_f32acc_cm2"; }
else { spv_data = flash_attn_f32_f16_f16acc_cm2_data; spv_size = flash_attn_f32_f16_f16acc_cm2_len; name = "flash_attn_f32_f16_aligned_f16acc_cm2"; }
} else {
@@ -4982,6 +5075,16 @@ static void ggml_vk_load_shaders(vk_device& device) {
ggml_vk_create_pipeline(device, device->pipeline_argmax_f32, "argmax_f32", argmax_f32_len, argmax_f32_data, "main", 2, sizeof(vk_op_push_constants), {1, 1, 1}, { device->subgroup_size }, 1);
ggml_vk_create_pipeline(device, device->pipeline_sum_rows_f32, "sum_rows_f32", sum_rows_f32_len, sum_rows_f32_data, "main", 2, sizeof(vk_op_sum_rows_push_constants), {1, 1, 1}, { device->subgroup_size }, 1);
// Intel Arc B390 was observed segfaulting with this shader.
if (device->subgroup_basic && device->subgroup_shuffle && device->vendor_id != VK_VENDOR_ID_INTEL) {
int idx = 0;
for (uint32_t n : {64, 128, 256, 512}) {
if (device->subgroup_size <= n) {
ggml_vk_create_pipeline(device, device->pipeline_fwht_f32[idx], "fwht_f32", fwht_f32_len, fwht_f32_data, "main", 2, sizeof(vk_op_fwht_push_constants), {1, 1, 1}, { device->subgroup_size, n }, 1, true, true, device->subgroup_size);
}
++idx;
}
}
const uint32_t cumsum_elem_per_thread = (device->vendor_id == VK_VENDOR_ID_AMD || device->vendor_id == VK_VENDOR_ID_INTEL) ? 2 : 4;
ggml_vk_create_pipeline(device, device->pipeline_cumsum_f32, "cumsum_f32", cumsum_f32_len, cumsum_f32_data, "main", 2, sizeof(vk_op_sum_rows_push_constants), {1, 1, 1}, { 256, device->subgroup_size, cumsum_elem_per_thread }, 1, true, true, device->subgroup_size);
@@ -5264,8 +5367,25 @@ static void ggml_vk_load_shaders(vk_device& device) {
}
}
for (auto &c : compiles) {
c.wait();
// Drop compile_mutex so other threads can walk while we compile.
compile_lock.unlock();
// Compile what we claimed; create_pipeline_func reacquires compile_mutex
// at the end to flip compile_pending/compiled and notify waiters.
if (has_claimed_task) {
auto & task = claimed_task;
ggml_vk_create_pipeline_func(device, task.pipeline, task.spv_size, task.spv_data,
task.entrypoint, task.parameter_count, task.wg_denoms,
task.specialization_constants, task.disable_robustness,
task.require_full_subgroups, task.required_subgroup_size);
}
// Another thread may be compiling the pipeline we need; block on it here.
if (wait_pipeline) {
std::unique_lock<std::mutex> wait_lock(device->compile_mutex);
device->compile_cv.wait(wait_lock, [&] {
return wait_pipeline->compiled.load();
});
}
}
@@ -5757,46 +5877,72 @@ static vk_device ggml_vk_get_device(size_t idx) {
found_fp16_256 = false,
found_fp32_128 = false,
found_fp32_256 = false;
bool found_bf16_128 = false,
found_bf16_256 = false;
// need to support fp16*fp16 with fp16/fp32 accumulator, for workgroupsize 128
// with 32x16x16 and 256 with 32x32x16.
for (auto &prop : flexible_dimensions) {
if (prop.saturatingAccumulation == VK_FALSE &&
prop.scope == VK_SCOPE_WORKGROUP_KHR &&
prop.AType == VK_COMPONENT_TYPE_FLOAT16_KHR &&
prop.BType == VK_COMPONENT_TYPE_FLOAT16_KHR) {
prop.scope == VK_SCOPE_WORKGROUP_KHR) {
if (prop.workgroupInvocations == 128 &&
prop.MGranularity <= 32 &&
prop.NGranularity <= 16 &&
prop.KGranularity <= 16) {
if (prop.CType == VK_COMPONENT_TYPE_FLOAT16_KHR &&
prop.ResultType == VK_COMPONENT_TYPE_FLOAT16_KHR) {
found_fp16_128 = true;
if (prop.AType == VK_COMPONENT_TYPE_FLOAT16_KHR &&
prop.BType == VK_COMPONENT_TYPE_FLOAT16_KHR) {
if (prop.workgroupInvocations == 128 &&
prop.MGranularity <= 32 &&
prop.NGranularity <= 16 &&
prop.KGranularity <= 16) {
if (prop.CType == VK_COMPONENT_TYPE_FLOAT16_KHR &&
prop.ResultType == VK_COMPONENT_TYPE_FLOAT16_KHR) {
found_fp16_128 = true;
}
if (prop.CType == VK_COMPONENT_TYPE_FLOAT32_KHR &&
prop.ResultType == VK_COMPONENT_TYPE_FLOAT32_KHR) {
found_fp32_128 = true;
}
}
if (prop.CType == VK_COMPONENT_TYPE_FLOAT32_KHR &&
prop.ResultType == VK_COMPONENT_TYPE_FLOAT32_KHR) {
found_fp32_128 = true;
if (prop.workgroupInvocations == 256 &&
prop.MGranularity <= 32 &&
prop.NGranularity <= 32 &&
prop.KGranularity <= 16) {
if (prop.CType == VK_COMPONENT_TYPE_FLOAT16_KHR &&
prop.ResultType == VK_COMPONENT_TYPE_FLOAT16_KHR) {
found_fp16_256 = true;
}
if (prop.CType == VK_COMPONENT_TYPE_FLOAT32_KHR &&
prop.ResultType == VK_COMPONENT_TYPE_FLOAT32_KHR) {
found_fp32_256 = true;
}
}
}
if (prop.workgroupInvocations == 256 &&
prop.MGranularity <= 32 &&
prop.NGranularity <= 32 &&
prop.KGranularity <= 16) {
if (prop.CType == VK_COMPONENT_TYPE_FLOAT16_KHR &&
prop.ResultType == VK_COMPONENT_TYPE_FLOAT16_KHR) {
found_fp16_256 = true;
#if defined(VK_KHR_shader_bfloat16) && defined(GGML_VULKAN_BFLOAT16_GLSLC_SUPPORT)
if (prop.AType == VK_COMPONENT_TYPE_BFLOAT16_KHR &&
prop.BType == VK_COMPONENT_TYPE_BFLOAT16_KHR &&
prop.CType == VK_COMPONENT_TYPE_FLOAT32_KHR &&
prop.ResultType == VK_COMPONENT_TYPE_FLOAT32_KHR) {
if (prop.workgroupInvocations == 128 &&
prop.MGranularity <= 32 &&
prop.NGranularity <= 16 &&
prop.KGranularity <= 16) {
found_bf16_128 = true;
}
if (prop.CType == VK_COMPONENT_TYPE_FLOAT32_KHR &&
prop.ResultType == VK_COMPONENT_TYPE_FLOAT32_KHR) {
found_fp32_256 = true;
if (prop.workgroupInvocations == 256 &&
prop.MGranularity <= 32 &&
prop.NGranularity <= 32 &&
prop.KGranularity <= 16) {
found_bf16_256 = true;
}
}
#endif
}
}
if (found_fp16_128 && found_fp16_256 &&
found_fp32_128 && found_fp32_256 &&
coopmat2_props.cooperativeMatrixFlexibleDimensionsMaxDimension >= 512) {
device->coopmat2 = true;
device->coopmat2_bf16_support = found_bf16_128 && found_bf16_256;
device->coopmat2_decode_vector = coopmat2_decode_vector_support && coopmat2_decode_vector_features.cooperativeMatrixDecodeVector;
}
}
@@ -6918,7 +7064,7 @@ static void * ggml_vk_host_malloc(vk_device& device, size_t size) {
return nullptr;
}
std::lock_guard<std::recursive_mutex> guard(device->mutex);
std::lock_guard<std::shared_mutex> guard(device->pinned_memory_mutex);
device->pinned_memory.push_back(std::make_tuple(buf->ptr, size, buf));
return buf->ptr;
@@ -6929,7 +7075,7 @@ static void ggml_vk_host_free(vk_device& device, void* ptr) {
return;
}
VK_LOG_MEMORY("ggml_vk_host_free(" << ptr << ")");
std::lock_guard<std::recursive_mutex> guard(device->mutex);
std::lock_guard<std::shared_mutex> guard(device->pinned_memory_mutex);
vk_buffer buf;
size_t index;
@@ -6953,7 +7099,7 @@ static void ggml_vk_host_free(vk_device& device, void* ptr) {
}
static void ggml_vk_host_get(const vk_device& device, const void * ptr, vk_buffer& buf, size_t& buf_offset) {
std::lock_guard<std::recursive_mutex> guard(device->mutex);
std::shared_lock<std::shared_mutex> guard(device->pinned_memory_mutex);
buf = nullptr;
buf_offset = 0;
for (size_t i = 0; i < device->pinned_memory.size(); i++) {
@@ -7074,13 +7220,6 @@ static void ggml_vk_dispatch_pipeline(ggml_backend_vk_context* ctx, vk_context&
subctx->s->buffer->buf.dispatch(wg0, wg1, wg2);
}
static void ggml_vk_end_submission(vk_submission& s, std::vector<vk_semaphore> wait_semaphores, std::vector<vk_semaphore> signal_semaphores) {
s.buffer->buf.end();
s.wait_semaphores = std::move(wait_semaphores);
s.signal_semaphores = std::move(signal_semaphores);
}
static void ggml_vk_ctx_end(vk_context& ctx) {
VK_LOG_DEBUG("ggml_vk_ctx_end(" << ctx << ", " << ctx->seqs.size() << ")");
if (ctx->s == nullptr) {
@@ -7233,7 +7372,7 @@ static void ggml_vk_buffer_write_nc_async(ggml_backend_vk_context * ctx, vk_cont
const uint64_t s_off = buf_offset + i3*nb3 + i2*nb2 + i1*nb1;
const uint64_t d_off = offset + i3*dstnb3 + i2*dstnb2 + i1*dstnb1;
for (uint64_t i0 = 0; i0 < ne0; i0++) {
slices.push_back({ s_off + i1*nb0, d_off + i0*dstnb0, dstnb0 });
slices.push_back({ s_off + i0*nb0, d_off + i0*dstnb0, dstnb0 });
}
}
}
@@ -8251,8 +8390,10 @@ static bool ggml_vk_should_use_mmvq(const vk_device& device, uint32_t m, uint32_
return false;
}
// General performance issue with q3_k and q6_k due to 2-byte alignment
if (src0_type == GGML_TYPE_Q3_K || src0_type == GGML_TYPE_Q6_K) {
// q6_k only has 2-byte alignment which makes it somewhat problematic,
// using MMVQ is only a win on Intel.
bool mmvq_q6 = device->vendor_id == VK_VENDOR_ID_INTEL;
if (src0_type == GGML_TYPE_Q6_K && !mmvq_q6) {
return false;
}
@@ -8264,7 +8405,7 @@ static bool ggml_vk_should_use_mmvq(const vk_device& device, uint32_t m, uint32_
// Quantization overhead is not worth it for small k
switch (device->vendor_id) {
case VK_VENDOR_ID_NVIDIA:
if (src0_type == GGML_TYPE_Q2_K || src0_type == GGML_TYPE_IQ1_S || src0_type == GGML_TYPE_IQ1_M) {
if (src0_type == GGML_TYPE_Q2_K || src0_type == GGML_TYPE_Q3_K || src0_type == GGML_TYPE_IQ1_S || src0_type == GGML_TYPE_IQ1_M) {
return true;
}
@@ -8291,9 +8432,16 @@ static bool ggml_vk_should_use_mmvq(const vk_device& device, uint32_t m, uint32_
return true;
}
case VK_VENDOR_ID_INTEL:
if (device->architecture == vk_device_architecture::INTEL_XE2) {
if (src0_type == GGML_TYPE_Q2_K || src0_type == GGML_TYPE_Q3_K || src0_type == GGML_TYPE_Q6_K) {
return true;
}
}
if (device->driver_id == vk::DriverId::eIntelProprietaryWindows) {
// Intel Windows proprietary driver MMVQ performance is worse than fp16, see
// https://github.com/ggml-org/llama.cpp/issues/17628
// Intel Windows proprietary driver MMVQ performance for !Q2/Q3/Q6 is worse than fp16,
// see https://github.com/ggml-org/llama.cpp/issues/17628 and
// https://github.com/ggml-org/llama.cpp/pull/23056
return false;
}
@@ -8741,6 +8889,68 @@ static void ggml_vk_mul_mat_vec_nc_f16_f32(ggml_backend_vk_context * ctx, vk_con
}, pc, { (uint32_t)ne03, (uint32_t)ne01, (uint32_t)ne12 });
}
static int ggml_vk_fwht_pipeline_idx(int64_t n) {
switch (n) {
case 64: return 0;
case 128: return 1;
case 256: return 2;
case 512: return 3;
default: return -1;
}
}
static bool ggml_vk_can_use_fwht(const ggml_backend_vk_context * ctx, const ggml_tensor * src1, const ggml_tensor * dst) {
if (ctx->num_additional_fused_ops != 0) {
return false;
}
if (ggml_get_op_params_i32(dst, 1) != GGML_HINT_SRC0_IS_HADAMARD) {
return false;
}
const int idx = ggml_vk_fwht_pipeline_idx(src1->ne[0]);
if (idx < 0 || ctx->device->pipeline_fwht_f32[idx] == nullptr) {
return false;
}
if (src1->type != GGML_TYPE_F32 || dst->type != GGML_TYPE_F32) {
return false;
}
if (!ggml_is_contiguous(src1)) {
return false;
}
GGML_ASSERT(ggml_is_contiguous(dst));
return true;
}
static void ggml_vk_fwht(ggml_backend_vk_context * ctx, vk_context& subctx, const ggml_tensor * src, ggml_tensor * dst) {
const int idx = ggml_vk_fwht_pipeline_idx(src->ne[0]);
vk_pipeline pipeline = ctx->device->pipeline_fwht_f32[idx];
const uint32_t rows_per_workgroup = 4;
const uint32_t n_rows = (uint32_t)ggml_nrows(src);
const uint32_t max_workgroups_x = ctx->device->properties.limits.maxComputeWorkGroupCount[0];
const uint32_t total_workgroups = CEIL_DIV(n_rows, rows_per_workgroup);
const uint32_t workgroups_x = std::min(total_workgroups, max_workgroups_x);
ggml_pipeline_request_descriptor_sets(ctx, pipeline, 1);
const vk_subbuffer src_buf = ggml_vk_tensor_subbuffer(ctx, src, true);
const vk_subbuffer dst_buf = ggml_vk_tensor_subbuffer(ctx, dst, true);
vk_op_fwht_push_constants pc = {
n_rows,
0,
0,
1.0f / std::sqrt((float)src->ne[0]),
};
init_pushconst_tensor_offsets(ctx, pc, src, nullptr, nullptr, nullptr, dst);
ggml_vk_dispatch_pipeline(ctx, subctx, pipeline, { src_buf, dst_buf }, pc, { workgroups_x, 1, 1 });
}
static void ggml_vk_mul_mat(ggml_backend_vk_context * ctx, vk_context& subctx, const struct ggml_cgraph * cgraph, int node_idx) {
ggml_tensor * dst = cgraph->nodes[node_idx];
ggml_tensor * src0 = dst->src[0];
@@ -8774,6 +8984,8 @@ static void ggml_vk_mul_mat(ggml_backend_vk_context * ctx, vk_context& subctx, c
m_offset += cur_M_size;
}
} else if (ggml_vk_can_use_fwht(ctx, src1, dst)) {
ggml_vk_fwht(ctx, subctx, src1, dst);
} else if (src0->type == GGML_TYPE_F16 && ggml_is_permuted(src0) && ggml_is_permuted(src1) && dst->ne[1] == 1 &&
// detect 0213 permutation, and batch size of 1
src0->nb[0] <= src0->nb[2] &&
@@ -9357,7 +9569,8 @@ static bool ggml_vk_flash_attn_scalar_shmem_support(const vk_device& device, con
const uint32_t Br = params.block_rows;
const uint32_t Bc = params.block_cols;
const uint32_t float_type_size = device->fp16 ? sizeof(ggml_fp16_t) : sizeof(float);
// BF16 uses the fp32 shader (FLOAT_TYPE=float)
const uint32_t float_type_size = (device->fp16 && k_type != GGML_TYPE_BF16) ? sizeof(ggml_fp16_t) : sizeof(float);
const bool mmq = ggml_vk_fa_scalar_uses_mmq(device, k_type);
@@ -9398,7 +9611,7 @@ static bool ggml_vk_flash_attn_scalar_shmem_support(const vk_device& device, con
return supported;
}
static bool ggml_vk_flash_attn_coopmat_shmem_support(const vk_device& device, const vk_fa_tuning_params& params, uint32_t hsk, uint32_t hsv, bool f32acc) {
static bool ggml_vk_flash_attn_coopmat_shmem_support(const vk_device& device, const vk_fa_tuning_params& params, uint32_t hsk, uint32_t hsv, bool f32acc, ggml_type k_type) {
// Needs to be kept up to date on shader changes
const uint32_t Br = params.block_rows;
const uint32_t Bc = params.block_cols;
@@ -9428,8 +9641,10 @@ static bool ggml_vk_flash_attn_coopmat_shmem_support(const vk_device& device, co
const uint32_t vsh_stride = MatBc / 4 * row_split;
const uint32_t ksh = ((kvshstride >= vsh_stride) ? (Bc * kvshstride) : (Bc * vsh_stride)) * f16vec4;
// BF16 PVMat accumulator is f32 (no bf16 accumulator support), so pvsh is vec4 (16 bytes)
const uint32_t pvsh_elem_size = (k_type == GGML_TYPE_BF16) ? 16u : f16vec4;
const uint32_t osh_stride = params.row_split * MatBr / 4;
const uint32_t pvsh = MatBc * osh_stride * f16vec4;
const uint32_t pvsh = MatBc * osh_stride * pvsh_elem_size;
const uint32_t slope = Br * acctype;
@@ -9498,7 +9713,7 @@ static void ggml_vk_flash_attn(ggml_backend_vk_context * ctx, vk_context& subctx
uint32_t workgroups_y = (uint32_t)neq2;
uint32_t workgroups_z = (uint32_t)neq3;
const bool f32acc = !ctx->device->fp16 || dst->op_params[3] == GGML_PREC_F32;
const bool f32acc = !ctx->device->fp16 || dst->op_params[3] == GGML_PREC_F32 || k->type == GGML_TYPE_BF16;
// For scalar/coopmat1 FA, we can use the "large" size to accommodate qga.
// For coopmat2 FA, we always use the small size (which is still pretty large for gqa).
@@ -9559,7 +9774,7 @@ static void ggml_vk_flash_attn(ggml_backend_vk_context * ctx, vk_context& subctx
vk_pipeline pipeline = nullptr;
{
std::lock_guard<std::recursive_mutex> guard(ctx->device->mutex);
std::lock_guard<std::mutex> guard(ctx->device->compile_mutex);
auto &pipelines = ctx->device->pipeline_flash_attn_f32_f16;
auto it = pipelines.find(fa_pipeline_state);
if (it != pipelines.end()) {
@@ -9623,13 +9838,15 @@ static void ggml_vk_flash_attn(ggml_backend_vk_context * ctx, vk_context& subctx
vk_pipeline pipeline_fa_mask_opt = nullptr;
if (use_mask_opt) {
std::lock_guard<std::recursive_mutex> guard(ctx->device->mutex);
auto &pipelines = ctx->device->pipeline_fa_mask_opt;
auto it = pipelines.find({Br, Bc});
if (it != pipelines.end()) {
pipeline_fa_mask_opt = it->second;
} else {
pipelines[{Br, Bc}] = pipeline_fa_mask_opt = std::make_shared<vk_pipeline_struct>();
{
std::lock_guard<std::mutex> guard(ctx->device->compile_mutex);
auto &pipelines = ctx->device->pipeline_fa_mask_opt;
auto it = pipelines.find({Br, Bc});
if (it != pipelines.end()) {
pipeline_fa_mask_opt = it->second;
} else {
pipelines[{Br, Bc}] = pipeline_fa_mask_opt = std::make_shared<vk_pipeline_struct>();
}
}
assert(pipeline_fa_mask_opt);
ggml_pipeline_request_descriptor_sets(ctx, pipeline_fa_mask_opt, 1);
@@ -10163,7 +10380,7 @@ static vk_pipeline ggml_vk_op_get_pipeline(ggml_backend_vk_context * ctx, const
vk_pipeline pipeline = nullptr;
{
std::lock_guard<std::recursive_mutex> guard(ctx->device->mutex);
std::lock_guard<std::mutex> guard(ctx->device->compile_mutex);
auto it = ctx->device->pipeline_solve_tri_f32.find(solve_tri_pipeline_state);
if (it != ctx->device->pipeline_solve_tri_f32.end()) {
pipeline = it->second;
@@ -10322,7 +10539,7 @@ static vk_pipeline ggml_vk_op_get_pipeline(ggml_backend_vk_context * ctx, const
vk_pipeline pipeline = nullptr;
{
std::lock_guard<std::recursive_mutex> guard(ctx->device->mutex);
std::lock_guard<std::mutex> guard(ctx->device->compile_mutex);
auto it = pipelines->find(conv2d_pipeline_state);
if (it != pipelines->end()) {
pipeline = it->second;
@@ -14351,12 +14568,6 @@ static const char * ggml_backend_vk_host_buffer_type_name(ggml_backend_buffer_ty
UNUSED(buft);
}
static const char * ggml_backend_vk_host_buffer_name(ggml_backend_buffer_t buffer) {
return GGML_VK_NAME "_Host";
UNUSED(buffer);
}
static void ggml_backend_vk_host_buffer_free_buffer(ggml_backend_buffer_t buffer) {
VK_LOG_MEMORY("ggml_backend_vk_host_buffer_free_buffer()");
ggml_vk_host_free(vk_instance.devices[0], buffer->context);
@@ -16309,6 +16520,7 @@ static bool ggml_backend_vk_device_supports_op(ggml_backend_dev_t dev, const ggm
switch (t) {
case GGML_TYPE_F32:
case GGML_TYPE_F16:
case GGML_TYPE_BF16:
case GGML_TYPE_Q8_0:
case GGML_TYPE_Q5_1:
case GGML_TYPE_Q5_0:
@@ -16324,6 +16536,9 @@ static bool ggml_backend_vk_device_supports_op(ggml_backend_dev_t dev, const ggm
if (!fa_kv_ok(op->src[1]->type) || !fa_kv_ok(op->src[2]->type)) {
return false;
}
if ((op->src[1]->type == GGML_TYPE_BF16) != (op->src[2]->type == GGML_TYPE_BF16)) {
return false;
}
if (!coopmat2 && !(device->subgroup_shuffle && device->subgroup_vote)) {
// scalar/coopmat1 FA uses subgroupShuffle/subgroupAll
return false;
@@ -97,8 +97,17 @@ layout (binding = 6) readonly buffer MO {uint32_t data_mask_opt[];};
#define FA_TYPE_Q5_0 6u
#define FA_TYPE_Q5_1 7u
#define FA_TYPE_Q8_0 8u
#define FA_TYPE_BF16 30u
#define FA_TYPE_Q1_0 41u
#if defined(BFLOAT16)
#define O_TYPE float
#define O_TYPEV4 vec4
#else
#define O_TYPE FLOAT_TYPE
#define O_TYPEV4 FLOAT_TYPEV4
#endif
// Number of matrix elements per buffer block, derived from the K/V type spec
// constant. F32 is treated as a vec4 "block" of 4 floats. F16 uses block size 1
// and bypasses the dequant path entirely. Quants follow their ggml block sizes.
@@ -111,6 +120,7 @@ uint fa_block_elems(uint ty) {
case FA_TYPE_Q5_0: return uint(QUANT_K_Q5_0);
case FA_TYPE_Q5_1: return uint(QUANT_K_Q5_1);
case FA_TYPE_Q8_0: return uint(QUANT_K_Q8_0);
case FA_TYPE_BF16: return 1u;
case FA_TYPE_Q1_0: return uint(QUANT_K_Q1_0); // cm2-only, harmless elsewhere
default: return 1u;
}
@@ -248,7 +258,7 @@ const float FATTN_KQ_MAX_OFFSET = 3.0f*0.6931f;
// Store the output when doing grouped query attention.
// Rows index by Q's dimension 2, and the first N rows are valid.
void gqaStore(const in uint32_t r, const in uint32_t c, const in FLOAT_TYPEV4 elems, const in uint32_t o_offset, const in uint32_t iq2, const in uint32_t N)
void gqaStore(const in uint32_t r, const in uint32_t c, const in O_TYPEV4 elems, const in uint32_t o_offset, const in uint32_t iq2, const in uint32_t N)
{
uint32_t offset = (iq2 + r) * HSV / 4 + c;
data_ov4[o_offset + offset] = D_TYPEV4(elems);
@@ -6,6 +6,10 @@
#extension GL_EXT_shader_explicit_arithmetic_types_float16 : require
#extension GL_EXT_shader_explicit_arithmetic_types_int32 : require
#if defined(BFLOAT16)
#extension GL_EXT_bfloat16 : enable
#endif
#extension GL_KHR_shader_subgroup_basic : enable
#extension GL_KHR_shader_subgroup_arithmetic : enable
#extension GL_KHR_shader_subgroup_vote : enable
@@ -14,7 +18,9 @@
#include "types.glsl"
#include "flash_attn_base.glsl"
#if !defined(BFLOAT16)
#include "flash_attn_dequant.glsl"
#endif
// These need to be supported N,M values for a MatBc x MatBr x 16 coopmatmuladd
const uint32_t MatBr = 16;
@@ -27,32 +33,32 @@ const uint32_t cols_per_thread = Bc / cols_per_iter;
layout (binding = 0) readonly buffer Q {float data_q[];};
layout (binding = 0) readonly buffer QV4 {vec4 data_qv4[];};
layout (binding = 1) readonly buffer K {float16_t data_k[];};
layout (binding = 1) readonly buffer KV4 {f16vec4 data_kv4[];};
layout (binding = 2) readonly buffer V {float16_t data_v[];};
layout (binding = 2) readonly buffer VV4 {f16vec4 data_vv4[];};
layout (binding = 1) readonly buffer K {FLOAT_TYPE data_k[];};
layout (binding = 1) readonly buffer KV4 {FLOAT_TYPEV4 data_kv4[];};
layout (binding = 2) readonly buffer V {FLOAT_TYPE data_v[];};
layout (binding = 2) readonly buffer VV4 {FLOAT_TYPEV4 data_vv4[];};
layout (binding = 3) readonly buffer M {float16_t data_m[];};
shared float tmpsh[row_split];
const uint32_t qstride = HSK_pad / 4 + 2; // in units of f16vec4
shared f16vec4 Qf[Br * qstride];
const uint32_t qstride = HSK_pad / 4 + 2;
shared FLOAT_TYPEV4 Qf[Br * qstride];
const uint psh_stride = Br / 4 + 2;
shared f16vec4 Psh[Bc * psh_stride];
shared FLOAT_TYPEV4 Psh[Bc * psh_stride];
// Avoid padding for hsk==256 to make it fit in 48KB shmem.
const uint32_t sfshstride = (HSK <= 128) ? (Br / 4 + 2) : Br / 4;
shared ACC_TYPEV4 sfsh[Bc * sfshstride];
const uint32_t D_pad = HSK_pad > HSV_pad ? HSK_pad : HSV_pad;
const uint32_t kvsh_stride = (SHMEM_STAGING != 0 ? D_pad : MatBr) / 4 + 2; // in units of f16vec4
const uint32_t kvsh_stride = (SHMEM_STAGING != 0 ? D_pad : MatBr) / 4 + 2;
const uint v_cols = MatBc / 4 * row_split; // total cols, 4 vec4s per MatBc * number of subgroups
const uint vsh_stride = v_cols;
shared f16vec4 kvsh[(kvsh_stride >= vsh_stride) ? (Bc * kvsh_stride) : (Bc * vsh_stride)];
shared FLOAT_TYPEV4 kvsh[(kvsh_stride >= vsh_stride) ? (Bc * kvsh_stride) : (Bc * vsh_stride)];
const uint32_t osh_stride = row_split * MatBr / 4;
shared f16vec4 pvsh[MatBc * osh_stride];
shared O_TYPEV4 pvsh[MatBc * osh_stride];
shared ACC_TYPE slope[Br];
@@ -76,7 +82,7 @@ void main() {
if ((HSK % 16) != 0) {
[[unroll]] for (uint i = 0; i < Br * qstride; i += gl_WorkGroupSize.x) {
if (i + tid < Br * qstride) {
Qf[i + tid] = f16vec4(0);
Qf[i + tid] = FLOAT_TYPEV4(0);
}
}
barrier();
@@ -89,15 +95,15 @@ void main() {
uint32_t r = (idx + tid) / (HSK / 4);
if (r < Br && d < HSK / 4 &&
i * Br + r < N) {
Qf[r * qstride + d] = f16vec4(data_qv4[q_offset / 4 + (i * Br + r) * q_stride / 4 + d] * p.scale);
Qf[r * qstride + d] = FLOAT_TYPEV4(data_qv4[q_offset / 4 + (i * Br + r) * q_stride / 4 + d] * p.scale);
}
}
barrier();
f16vec4 Of[rows_per_thread][d_per_thread];
O_TYPEV4 Of[rows_per_thread][d_per_thread];
[[unroll]] for (uint32_t r = 0; r < rows_per_thread; ++r) {
[[unroll]] for (uint32_t d = 0; d < d_per_thread; ++d) {
Of[r][d] = f16vec4(0.0);
Of[r][d] = O_TYPEV4(0.0);
}
}
@@ -222,15 +228,18 @@ void main() {
uint32_t d = (idx + tid) % (HSK_pad / 4);
uint32_t c = (idx + tid) / (HSK_pad / 4);
if (idx + gl_WorkGroupSize.x <= Bc * HSK_pad / 4 || c < Bc) {
f16vec4 K_Tf = f16vec4(0);
FLOAT_TYPEV4 K_Tf = FLOAT_TYPEV4(0);
if ((!KV_bounds_check || j * Bc + c < KV) && (HSK == HSK_pad || d < HSK / 4)) {
#if !defined(BFLOAT16)
if (USE_DECODE_K) {
uint coord = (j * Bc + c) * k_stride * BLOCK_SIZE_K + 4 * d;
uint ib = coord / BLOCK_SIZE_K;
uint iqs = (coord % BLOCK_SIZE_K);
K_Tf = dequantize4(ib, iqs, k_offset, BINDING_IDX_K);
} else {
K_Tf = f16vec4(data_kv4[k_offset / 4 + (j * Bc + c) * k_stride / 4 + d]);
} else
#endif
{
K_Tf = FLOAT_TYPEV4(data_kv4[k_offset / 4 + (j * Bc + c) * k_stride / 4 + d]);
}
}
@@ -244,16 +253,16 @@ void main() {
// Bc split across workgroup (four subgroups), loop over HSK in chunks of 16: 16 x 16 * 16 x 16 -> 16 x 16
// This is written transposed in order to allow for N being 8 if implementations need it
coopmat<ACC_TYPE, gl_ScopeSubgroup, MatBc, MatBr, gl_MatrixUseAccumulator> SfMat = coopmat<ACC_TYPE, gl_ScopeSubgroup, MatBc, MatBr, gl_MatrixUseAccumulator>(0);
coopmat<float16_t, gl_ScopeSubgroup, MatBc, 16, gl_MatrixUseA> KMat;
coopmat<float16_t, gl_ScopeSubgroup, 16, MatBr, gl_MatrixUseB> QMat;
coopmat<FLOAT_TYPE, gl_ScopeSubgroup, MatBc, 16, gl_MatrixUseA> KMat;
coopmat<FLOAT_TYPE, gl_ScopeSubgroup, 16, MatBr, gl_MatrixUseB> QMat;
[[unroll]] for (uint32_t d = 0; d < HSK_pad / 16; ++d) {
// If SHMEM_STAGING is set, a Bc * HSK_pad size tile of K is loaded to shmem
// If not, f16 K is loaded directly from global memory if aligned, otherwise
// If not, K is loaded directly from global memory if aligned, otherwise
// staged through a Bc * MatBr size staging buffer.
// If K is not type f16, then it is always staged for dequantization.
// If K is a quant type, then it is always staged for dequantization.
if (SHMEM_STAGING == 0) {
// For quants we always need to dequant into kvsh; for f16 we can load
// For quants we always need to dequant into kvsh; for f16/bf16 we can load
// directly from global memory when alignment / bounds allow it.
const bool stage_k = USE_DECODE_K || KV_bounds_check || d * 16 + 16 > HSK;
if (stage_k) {
@@ -262,15 +271,18 @@ void main() {
uint32_t col_vec = (idx + tid) % (MatBr / 4);
uint32_t row = (idx + tid) / (MatBr / 4);
if (idx + tid < Bc * MatBr / 4) {
f16vec4 K_Tf = f16vec4(0);
FLOAT_TYPEV4 K_Tf = FLOAT_TYPEV4(0);
if ((!KV_bounds_check || j * Bc + row < KV) && (HSK == HSK_pad || d * 16 + col_vec * 4 < HSK)) {
#if !defined(BFLOAT16)
if (USE_DECODE_K) {
uint coord = (j * Bc + row) * k_stride * BLOCK_SIZE_K + d * 16 + col_vec * 4;
uint ib = coord / BLOCK_SIZE_K;
uint iqs = (coord % BLOCK_SIZE_K);
K_Tf = dequantize4(ib, iqs, k_offset, BINDING_IDX_K);
} else {
K_Tf = f16vec4(data_kv4[k_offset / 4 + (j * Bc + row) * k_stride / 4 + d * 16 / 4 + col_vec]);
} else
#endif
{
K_Tf = FLOAT_TYPEV4(data_kv4[k_offset / 4 + (j * Bc + row) * k_stride / 4 + d * 16 / 4 + col_vec]);
}
}
@@ -357,7 +369,7 @@ void main() {
[[unroll]] for (uint32_t d0 = 0; d0 < HSV / 4; d0 += threads_per_rowgroup) {
const uint d_local = d0 / threads_per_rowgroup;
[[unroll]] for (uint32_t r = 0; r < rows_per_thread; ++r) {
Of[r][d_local] = float16_t(eMf[r]) * Of[r][d_local];
Of[r][d_local] = O_TYPE(eMf[r]) * Of[r][d_local];
}
}
@@ -368,10 +380,10 @@ void main() {
[[unroll]] for (uint32_t r = 0; r < rows_per_thread; r += 4) {
const uint row = tile_row(r);
if (KV_bounds_check && j * Bc + col >= KV) {
Psh[col * psh_stride + row / 4] = f16vec4(0.0f);
Psh[col * psh_stride + row / 4] = FLOAT_TYPEV4(0.0f);
} else {
const vec4 mfvec = vec4(Mf[r], Mf[r + 1], Mf[r + 2], Mf[r + 3]);
const f16vec4 Pf = f16vec4(exp(vec4(sfsh[row / 4 + col * sfshstride]) - mfvec));
const FLOAT_TYPEV4 Pf = FLOAT_TYPEV4(exp(vec4(sfsh[row / 4 + col * sfshstride]) - mfvec));
[[unroll]] for (uint32_t vec_idx = 0; vec_idx < 4; ++vec_idx) {
Lf[r + vec_idx] += Pf[vec_idx];
}
@@ -385,15 +397,18 @@ void main() {
uint32_t d = (idx + tid) % (HSV_pad / 4);
uint32_t c = (idx + tid) / (HSV_pad / 4);
if (idx + gl_WorkGroupSize.x <= Bc * HSV_pad / 4 || c < Bc) {
f16vec4 V_Tf = f16vec4(0);
FLOAT_TYPEV4 V_Tf = FLOAT_TYPEV4(0);
if ((!KV_bounds_check || j * Bc + c < KV) && (HSV == HSV_pad || d < HSV / 4)) {
#if !defined(BFLOAT16)
if (USE_DECODE_V) {
uint coord = (j * Bc + c) * v_stride * BLOCK_SIZE_V + 4 * d;
uint ib = coord / BLOCK_SIZE_V;
uint iqs = (coord % BLOCK_SIZE_V);
V_Tf = dequantize4(ib, iqs, v_offset, BINDING_IDX_V);
} else {
V_Tf = f16vec4(data_vv4[v_offset / 4 + (j * Bc + c) * v_stride / 4 + d]);
} else
#endif
{
V_Tf = FLOAT_TYPEV4(data_vv4[v_offset / 4 + (j * Bc + c) * v_stride / 4 + d]);
}
}
@@ -409,7 +424,7 @@ void main() {
[[unroll]] for (uint32_t hsv_tile = 0; hsv_tile < num_hsv_tiles; ++hsv_tile) {
const uint hsv_offset = (hsv_tile * row_split + gl_SubgroupID) * 16;
coopmat<float16_t, gl_ScopeSubgroup, MatBc, MatBr, gl_MatrixUseAccumulator> PVMat = coopmat<float16_t, gl_ScopeSubgroup, MatBc, MatBr, gl_MatrixUseAccumulator>(0);
coopmat<O_TYPE, gl_ScopeSubgroup, MatBc, MatBr, gl_MatrixUseAccumulator> PVMat = coopmat<O_TYPE, gl_ScopeSubgroup, MatBc, MatBr, gl_MatrixUseAccumulator>(0);
// Preload V tiles for [Bc, 16 * num subgroups]
const uint v_rows = Bc;
@@ -417,11 +432,11 @@ void main() {
const uint v_loads_per_thread = v_total / gl_WorkGroupSize.x;
// If SHMEM_STAGING is set, a Bc * HSV_pad size tile of V is loaded to shmem.
// If not, f16 V is loaded directly from global memory if aligned, otherwise
// If not, V is loaded directly from global memory if aligned, otherwise
// staged through a Bc * MatBr size staging buffer.
// If V is not type f16, then it is always staged for dequantization.
// If V is a quant type, then it is always staged for dequantization.
if (SHMEM_STAGING == 0) {
// For quants we always preload via kvsh. For f16 we only preload when
// For quants we always preload via kvsh. For f16/bf16 we only preload when
// alignment / bounds force it (otherwise we coopMatLoad direct from data_vv4).
const bool stage_v = USE_DECODE_V || KV_bounds_check;
if (stage_v) {
@@ -438,13 +453,16 @@ void main() {
const uint iqs = coord % BLOCK_SIZE_V;
if (!KV_bounds_check || (v_row < KV && v_col < HSV)) {
#if !defined(BFLOAT16)
if (USE_DECODE_V) {
kvsh[row * vsh_stride + col] = dequantize4(ib, iqs, v_offset, BINDING_IDX_V);
} else {
} else
#endif
{
kvsh[row * vsh_stride + col] = data_vv4[(v_offset + v_row * v_stride + v_col) / 4];
}
} else {
kvsh[row * vsh_stride + col] = f16vec4(0.0f);
kvsh[row * vsh_stride + col] = FLOAT_TYPEV4(0.0f);
}
}
}
@@ -459,7 +477,7 @@ void main() {
if (SHMEM_STAGING == 0) {
if (!USE_DECODE_V && !KV_bounds_check) {
// F16 values can be loaded directly from global memory
// F16/BF16 values can be loaded directly from global memory
const uint v_tile_row = j * Bc + bc_chunk * MatBc;
const uint v_tile_offset = v_offset / 4 + v_tile_row * v_stride / 4 + hsv_offset / 4;
coopMatLoad(QMat, data_vv4, v_tile_offset, v_stride / 4, gl_CooperativeMatrixLayoutRowMajor);
@@ -573,7 +591,7 @@ void main() {
[[unroll]] for (uint32_t d0 = 0; d0 < HSV / 4; d0 += threads_per_rowgroup) {
const uint d_local = d0 / threads_per_rowgroup;
Of[r][d_local] *= float16_t(ms);
Of[r][d_local] *= O_TYPE(ms);
}
} else {
vs = exp(sink - Mf[r]);
@@ -591,7 +609,7 @@ void main() {
[[unroll]] for (uint32_t d0 = 0; d0 < HSV / 4; d0 += threads_per_rowgroup) {
const uint d_local = d0 / threads_per_rowgroup;
[[unroll]] for (uint32_t r = 0; r < rows_per_thread; ++r) {
Of[r][d_local] *= float16_t(Lfrcp[r]);
Of[r][d_local] *= O_TYPE(Lfrcp[r]);
#if defined(FLOAT_TYPE_MAX)
Of[r][d_local] = clamp(Of[r][d_local], -FLOAT_TYPE_MAX, FLOAT_TYPE_MAX);
#endif
@@ -8,6 +8,10 @@
#extension GL_EXT_shader_explicit_arithmetic_types_int32 : require
#extension GL_EXT_shader_explicit_arithmetic_types_int16 : require
#if defined(BFLOAT16)
#extension GL_EXT_bfloat16 : enable
#endif
#extension GL_KHR_memory_scope_semantics : enable
#extension GL_KHR_cooperative_matrix : enable
#extension GL_NV_cooperative_matrix2 : enable
@@ -21,7 +25,9 @@
#include "types.glsl"
#include "flash_attn_base.glsl"
#if !defined(BFLOAT16)
#include "dequant_funcs_cm2.glsl"
#endif
// buffer_reference stride = sizeof(struct) = FaBlockBytesK/V.
layout(buffer_reference, std430, buffer_reference_align = 1) buffer decodeBufFA_K {
@@ -31,6 +37,7 @@ layout(buffer_reference, std430, buffer_reference_align = 1) buffer decodeBufFA_
uint8_t raw[FaBlockBytesV];
};
#if !defined(BFLOAT16)
float16_t faDecodeK(const decodeBufFA_K bl_in, const uint blockCoords[2], const uint coordInBlock[2]) {
switch (FaTypeK) {
case FA_TYPE_F32: return dequantFuncF32 (decodeBufF32 (bl_in), blockCoords, coordInBlock);
@@ -91,6 +98,7 @@ f16vec4 faDecodeVVector(const decodeBufFA_V bl_in, const uint blockCoords[2], co
#define FADECODEK , faDecodeK
#define FADECODEV , faDecodeV
#endif
#endif
layout (binding = 0) readonly buffer Q {uint8_t data_q[];};
layout (binding = 1) readonly buffer K {uint8_t data_k[];};
@@ -195,15 +203,15 @@ void main() {
tensorLayoutV = setTensorLayoutStrideNV(tensorLayoutV, v_stride, 1);
coopmat<Q_TYPE, gl_ScopeWorkgroup, Br, HSK_pad, gl_MatrixUseAccumulator> Q;
coopmat<float16_t, gl_ScopeWorkgroup, Br, HSK_pad, gl_MatrixUseA> Qf16;
coopmat<FLOAT_TYPE, gl_ScopeWorkgroup, Br, HSK_pad, gl_MatrixUseA> Qf16;
uint32_t q_offset = gqa_iq1*p.nb01*4/*sizeof(float)*/ + iq2*p.nb02+iq3*p.nb03;
coopMatLoadTensorNV(Q, data_q, q_offset, sliceTensorLayoutNV(tensorLayoutQ, i * Br, Br, 0, HSK_pad));
Qf16 = coopmat<float16_t, gl_ScopeWorkgroup, Br, HSK_pad, gl_MatrixUseA>(Q);
Qf16 *= float16_t(p.scale);
Q *= Q_TYPE(p.scale);
Qf16 = coopmat<FLOAT_TYPE, gl_ScopeWorkgroup, Br, HSK_pad, gl_MatrixUseA>(Q);
coopmat<float16_t, gl_ScopeWorkgroup, Br, HSV_pad, gl_MatrixUseAccumulator> O = coopmat<float16_t, gl_ScopeWorkgroup, Br, HSV_pad, gl_MatrixUseAccumulator>(0);
coopmat<O_TYPE, gl_ScopeWorkgroup, Br, HSV_pad, gl_MatrixUseAccumulator> O = coopmat<O_TYPE, gl_ScopeWorkgroup, Br, HSV_pad, gl_MatrixUseAccumulator>(0);
coopmat<ACC_TYPE, gl_ScopeWorkgroup, Br, Bc, gl_MatrixUseAccumulator> L, M;
@@ -291,16 +299,20 @@ void main() {
coopmat<ACC_TYPE, gl_ScopeWorkgroup, Br, Bc, gl_MatrixUseAccumulator> S = coopmat<ACC_TYPE, gl_ScopeWorkgroup, Br, Bc, gl_MatrixUseAccumulator>(0);
coopmat<float16_t, gl_ScopeWorkgroup, HSK_pad, Bc, gl_MatrixUseB> K_T;
coopmat<FLOAT_TYPE, gl_ScopeWorkgroup, HSK_pad, Bc, gl_MatrixUseB> K_T;
uint32_t k_offset = ik2*p.nb12 + ik3*p.nb13;
// F16: bs_k==1 (direct load). F32: bs_k==4 (vec4 / dequantFuncF32). Q4/Q8 family: bs_k==32. Q1_0: bs_k==128.
#if defined(BFLOAT16)
coopMatLoadTensorNV(K_T, data_k, k_offset, sliceTensorLayoutNV(tensorLayoutK, j * Bc, Bc, 0, HSK_pad), tensorViewTranspose);
#else
const bool k_use_decode = (bs_k > 1u);
if (k_use_decode) {
coopMatLoadTensorNV(K_T, data_k, k_offset, sliceTensorLayoutNV(tensorLayoutK, j * Bc, Bc, 0, HSK_pad), tensorViewTranspose FADECODEK);
} else {
coopMatLoadTensorNV(K_T, data_k, k_offset, sliceTensorLayoutNV(tensorLayoutK, j * Bc, Bc, 0, HSK_pad), tensorViewTranspose);
}
#endif
S = coopMatMulAdd(Qf16, K_T, S);
if (LOGIT_SOFTCAP) {
@@ -351,22 +363,26 @@ void main() {
coopMatPerElementNV(P, P, replacePadding, ACC_TYPE(0.0), R, C);
}
coopmat<float16_t, gl_ScopeWorkgroup, Br, Bc, gl_MatrixUseA> P_A = coopmat<float16_t, gl_ScopeWorkgroup, Br, Bc, gl_MatrixUseA>(P);
coopmat<FLOAT_TYPE, gl_ScopeWorkgroup, Br, Bc, gl_MatrixUseA> P_A = coopmat<FLOAT_TYPE, gl_ScopeWorkgroup, Br, Bc, gl_MatrixUseA>(P);
// compute rowsum by multiplying by matrix of all ones.
coopmat<float16_t, gl_ScopeWorkgroup, Bc, Bc, gl_MatrixUseB> One = coopmat<float16_t, gl_ScopeWorkgroup, Bc, Bc, gl_MatrixUseB>(1.0);
coopmat<FLOAT_TYPE, gl_ScopeWorkgroup, Bc, Bc, gl_MatrixUseB> One = coopmat<FLOAT_TYPE, gl_ScopeWorkgroup, Bc, Bc, gl_MatrixUseB>(1.0);
rowsum = coopmat<ACC_TYPE, gl_ScopeWorkgroup, Br, Bc, gl_MatrixUseAccumulator>(0.0);
rowsum = coopMatMulAdd(P_A, One, rowsum);
coopmat<float16_t, gl_ScopeWorkgroup, Bc, HSV_pad, gl_MatrixUseB> V;
coopmat<FLOAT_TYPE, gl_ScopeWorkgroup, Bc, HSV_pad, gl_MatrixUseB> V;
uint32_t v_offset = iv2*p.nb22 + iv3*p.nb23;
#if defined(BFLOAT16)
coopMatLoadTensorNV(V, data_v, v_offset, sliceTensorLayoutNV(tensorLayoutV, j * Bc, Bc, 0, HSV_pad));
#else
const bool v_use_decode = (bs_v > 1u);
if (v_use_decode) {
coopMatLoadTensorNV(V, data_v, v_offset, sliceTensorLayoutNV(tensorLayoutV, j * Bc, Bc, 0, HSV_pad) FADECODEV);
} else {
coopMatLoadTensorNV(V, data_v, v_offset, sliceTensorLayoutNV(tensorLayoutV, j * Bc, Bc, 0, HSV_pad));
}
#endif
L = eM*L + rowsum;
@@ -378,7 +394,7 @@ void main() {
// resize eM by using smear/reduce
coopMatReduceNV(eMdiag, eM, gl_CooperativeMatrixReduceRowNV, smearReduce);
O *= coopmat<float16_t, gl_ScopeWorkgroup, Br, HSV_pad, gl_MatrixUseAccumulator>(eMdiag);
O *= coopmat<O_TYPE, gl_ScopeWorkgroup, Br, HSV_pad, gl_MatrixUseAccumulator>(eMdiag);
O = coopMatMulAdd(P_A, V, O);
}
@@ -427,7 +443,7 @@ void main() {
if (sink > Mr[i]) {
ms = exp(Mr[i] - sink);
O[i] *= float16_t(ms);
O[i] *= O_TYPE(ms);
} else {
vs = exp(sink - Mr[i]);
}
@@ -28,6 +28,9 @@ layout (binding = 2) readonly buffer V_PACKED_Q5_1 { block_q5_1_packed16 data[];
layout (binding = 1) readonly buffer K_PACKED_Q8_0 { block_q8_0_packed16 data[]; } k_packed_q8_0;
layout (binding = 2) readonly buffer V_PACKED_Q8_0 { block_q8_0_packed16 data[]; } v_packed_q8_0;
layout (binding = 1) readonly buffer K_PACKED_BF16 { u16vec4 data[]; } k_packed_bf16;
layout (binding = 2) readonly buffer V_PACKED_BF16 { u16vec4 data[]; } v_packed_bf16;
// Q4_1 and Q5_1 packed32 views: aliased to the same memory as the packed16
// views, used by the MMQ K-side hot path for fast 4-uint loads.
layout (binding = 1) readonly buffer K_PACKED_Q4_1_P32 { block_q4_1_packed32 data[]; } k_packed_q4_1_p32;
@@ -99,6 +102,9 @@ layout (binding = 1) readonly buffer K_PACKED_Q5_1_P32 { block_q5_1_packed32 dat
return FLOAT_TYPE(BUF.data[a_offset + ib].d) * FLOAT_TYPEV4(v0.x, v0.y, v1.x, v1.y); \
}
#define FA_DEQUANT4_BF16(BUF) \
return FLOAT_TYPEV4(bf16_to_fp32(uvec4(BUF.data[(a_offset + ib) / 4])));
FLOAT_TYPEV4 dequantize4(uint ib, uint iqs, uint a_offset, uint binding_idx) {
if (binding_idx == BINDING_IDX_K) {
switch (FaTypeK) {
@@ -108,6 +114,7 @@ FLOAT_TYPEV4 dequantize4(uint ib, uint iqs, uint a_offset, uint binding_idx) {
case FA_TYPE_Q5_0: FA_DEQUANT4_Q5_0(k_packed_q5_0)
case FA_TYPE_Q5_1: FA_DEQUANT4_Q5_1(k_packed_q5_1)
case FA_TYPE_Q8_0: FA_DEQUANT4_Q8_0(k_packed_q8_0)
case FA_TYPE_BF16: FA_DEQUANT4_BF16(k_packed_bf16)
}
} else {
switch (FaTypeV) {
@@ -117,6 +124,7 @@ FLOAT_TYPEV4 dequantize4(uint ib, uint iqs, uint a_offset, uint binding_idx) {
case FA_TYPE_Q5_0: FA_DEQUANT4_Q5_0(v_packed_q5_0)
case FA_TYPE_Q5_1: FA_DEQUANT4_Q5_1(v_packed_q5_1)
case FA_TYPE_Q8_0: FA_DEQUANT4_Q8_0(v_packed_q8_0)
case FA_TYPE_BF16: FA_DEQUANT4_BF16(v_packed_bf16)
}
}
return FLOAT_TYPEV4(0);
@@ -0,0 +1,69 @@
#version 450
#extension GL_EXT_control_flow_attributes : require
#extension GL_KHR_shader_subgroup_basic : enable
#extension GL_KHR_shader_subgroup_shuffle : enable
layout(local_size_x_id = 0, local_size_y = 4, local_size_z = 1) in;
layout(constant_id = 0) const uint WARP_SIZE = 32;
layout(constant_id = 1) const uint N = 128;
layout(push_constant) uniform parameter
{
uint n_rows;
uint src_offset;
uint dst_offset;
float scale;
};
layout(binding = 0, std430) readonly buffer A { float data_a[]; };
layout(binding = 1, std430) writeonly buffer D { float data_d[]; };
const uint EL_W = N / WARP_SIZE;
void main() {
const uint lane = gl_SubgroupInvocationID;
for (uint row = gl_WorkGroupID.x * gl_WorkGroupSize.y + gl_SubgroupID;
row < n_rows;
row += gl_NumWorkGroups.x * gl_WorkGroupSize.y) {
const uint row_offset = row * N;
float reg[EL_W];
[[unroll]]
for (uint i = 0; i < EL_W; ++i) {
reg[i] = data_a[src_offset + row_offset + i * WARP_SIZE + lane] * scale;
}
[[unroll]]
for (uint h = 1; h < WARP_SIZE; h <<= 1) {
[[unroll]]
for (uint j = 0; j < EL_W; ++j) {
const float val = reg[j];
const float val2 = subgroupShuffleXor(val, h);
reg[j] = (lane & h) == 0 ? val + val2 : val2 - val;
}
}
[[unroll]]
for (uint h = WARP_SIZE; h < N; h <<= 1) {
const uint step = h / WARP_SIZE;
[[unroll]]
for (uint j = 0; j < EL_W; j += 2 * step) {
[[unroll]]
for (uint k = 0; k < step; ++k) {
const float x = reg[j + k];
const float y = reg[j + k + step];
reg[j + k] = x + y;
reg[j + k + step] = x - y;
}
}
}
[[unroll]]
for (uint i = 0; i < EL_W; ++i) {
data_d[dst_offset + row_offset + i * WARP_SIZE + lane] = reg[i];
}
}
}
@@ -212,28 +212,40 @@ i32vec4 repack4(uint ib, uint iqs) {
const uint qs_shift = ((iqs_k % 32) / 8) * 2;
const uint hm_shift = iqs_k / 8;
// bitwise OR to add 4 if hmask is set, subtract later
const i8vec2 vals00 = unpack8(int16_t((data_a_packed16[ib_k].qs[qs_idx * 2 ] >> qs_shift) & uint16_t(0x0303))) |
unpack8(int16_t(((data_a_packed16[ib_k].hmask[iqs * 2 ] >> hm_shift) & uint16_t(0x0101)) << 2));
const i8vec2 vals01 = unpack8(int16_t((data_a_packed16[ib_k].qs[qs_idx * 2 + 1] >> qs_shift) & uint16_t(0x0303))) |
unpack8(int16_t(((data_a_packed16[ib_k].hmask[iqs * 2 + 1] >> hm_shift) & uint16_t(0x0101)) << 2));
const i8vec2 vals10 = unpack8(int16_t((data_a_packed16[ib_k].qs[qs_idx * 2 + 2] >> qs_shift) & uint16_t(0x0303))) |
unpack8(int16_t(((data_a_packed16[ib_k].hmask[iqs * 2 + 2] >> hm_shift) & uint16_t(0x0101)) << 2));
const i8vec2 vals11 = unpack8(int16_t((data_a_packed16[ib_k].qs[qs_idx * 2 + 3] >> qs_shift) & uint16_t(0x0303))) |
unpack8(int16_t(((data_a_packed16[ib_k].hmask[iqs * 2 + 3] >> hm_shift) & uint16_t(0x0101)) << 2));
const i8vec2 vals20 = unpack8(int16_t((data_a_packed16[ib_k].qs[qs_idx * 2 + 4] >> qs_shift) & uint16_t(0x0303))) |
unpack8(int16_t(((data_a_packed16[ib_k].hmask[iqs * 2 + 4] >> hm_shift) & uint16_t(0x0101)) << 2));
const i8vec2 vals21 = unpack8(int16_t((data_a_packed16[ib_k].qs[qs_idx * 2 + 5] >> qs_shift) & uint16_t(0x0303))) |
unpack8(int16_t(((data_a_packed16[ib_k].hmask[iqs * 2 + 5] >> hm_shift) & uint16_t(0x0101)) << 2));
const i8vec2 vals30 = unpack8(int16_t((data_a_packed16[ib_k].qs[qs_idx * 2 + 6] >> qs_shift) & uint16_t(0x0303))) |
unpack8(int16_t(((data_a_packed16[ib_k].hmask[iqs * 2 + 6] >> hm_shift) & uint16_t(0x0101)) << 2));
const i8vec2 vals31 = unpack8(int16_t((data_a_packed16[ib_k].qs[qs_idx * 2 + 7] >> qs_shift) & uint16_t(0x0303))) |
unpack8(int16_t(((data_a_packed16[ib_k].hmask[iqs * 2 + 7] >> hm_shift) & uint16_t(0x0101)) << 2));
const uvec4 qs = uvec4( uint32_t(data_a_packed16[ib_k].qs[qs_idx * 2 ]) |
(uint32_t(data_a_packed16[ib_k].qs[qs_idx * 2 + 1]) << 16),
uint32_t(data_a_packed16[ib_k].qs[qs_idx * 2 + 2]) |
(uint32_t(data_a_packed16[ib_k].qs[qs_idx * 2 + 3]) << 16),
uint32_t(data_a_packed16[ib_k].qs[qs_idx * 2 + 4]) |
(uint32_t(data_a_packed16[ib_k].qs[qs_idx * 2 + 5]) << 16),
uint32_t(data_a_packed16[ib_k].qs[qs_idx * 2 + 6]) |
(uint32_t(data_a_packed16[ib_k].qs[qs_idx * 2 + 7]) << 16));
return i32vec4(pack32(i8vec4(vals00.x, vals00.y, vals01.x, vals01.y) - int8_t(4)),
pack32(i8vec4(vals10.x, vals10.y, vals11.x, vals11.y) - int8_t(4)),
pack32(i8vec4(vals20.x, vals20.y, vals21.x, vals21.y) - int8_t(4)),
pack32(i8vec4(vals30.x, vals30.y, vals31.x, vals31.y) - int8_t(4)));
const uvec4 hmask = uvec4( uint32_t(data_a_packed16[ib_k].hmask[iqs * 2 ]) |
(uint32_t(data_a_packed16[ib_k].hmask[iqs * 2 + 1]) << 16),
uint32_t(data_a_packed16[ib_k].hmask[iqs * 2 + 2]) |
(uint32_t(data_a_packed16[ib_k].hmask[iqs * 2 + 3]) << 16),
uint32_t(data_a_packed16[ib_k].hmask[iqs * 2 + 4]) |
(uint32_t(data_a_packed16[ib_k].hmask[iqs * 2 + 5]) << 16),
uint32_t(data_a_packed16[ib_k].hmask[iqs * 2 + 6]) |
(uint32_t(data_a_packed16[ib_k].hmask[iqs * 2 + 7]) << 16));
// bitwise OR to add 4 if hmask is set, subtract later
const uint vals0 = (( qs.x >> qs_shift) & 0x03030303) |
(((hmask.x >> hm_shift) & 0x01010101) << 2);
const uint vals1 = (( qs.y >> qs_shift) & 0x03030303) |
(((hmask.y >> hm_shift) & 0x01010101) << 2);
const uint vals2 = (( qs.z >> qs_shift) & 0x03030303) |
(((hmask.z >> hm_shift) & 0x01010101) << 2);
const uint vals3 = (( qs.w >> qs_shift) & 0x03030303) |
(((hmask.w >> hm_shift) & 0x01010101) << 2);
// Subtract 4 by twiddling bits rather than using re-packing as mesa
// compiles repacking poorly.
return i32vec4(int32_t(((vals0 ^ 0x80808080) - 0x04040404) ^ 0x80808080),
int32_t(((vals1 ^ 0x80808080) - 0x04040404) ^ 0x80808080),
int32_t(((vals2 ^ 0x80808080) - 0x04040404) ^ 0x80808080),
int32_t(((vals3 ^ 0x80808080) - 0x04040404) ^ 0x80808080));
}
float get_d_scale(uint ib, uint iqs) {
@@ -343,27 +355,39 @@ i32vec4 repack4(uint ib, uint iqs) {
const uint qh_idx = (iqs_k / 32) * 8 + iqs;
const uint qh_shift = ((iqs_k % 32) / 8) * 2;
const i8vec2 vals00 = (unpack8(int16_t((data_a_packed16[ib_k].ql[ql_idx * 2 ] >> ql_shift) & uint16_t(0x0F0F))) |
unpack8(int16_t(((data_a_packed16[ib_k].qh[qh_idx * 2 ] >> qh_shift) & uint16_t(0x0303)) << 4))) - int8_t(32);
const i8vec2 vals01 = (unpack8(int16_t((data_a_packed16[ib_k].ql[ql_idx * 2 + 1] >> ql_shift) & uint16_t(0x0F0F))) |
unpack8(int16_t(((data_a_packed16[ib_k].qh[qh_idx * 2 + 1] >> qh_shift) & uint16_t(0x0303)) << 4))) - int8_t(32);
const i8vec2 vals10 = (unpack8(int16_t((data_a_packed16[ib_k].ql[ql_idx * 2 + 2] >> ql_shift) & uint16_t(0x0F0F))) |
unpack8(int16_t(((data_a_packed16[ib_k].qh[qh_idx * 2 + 2] >> qh_shift) & uint16_t(0x0303)) << 4))) - int8_t(32);
const i8vec2 vals11 = (unpack8(int16_t((data_a_packed16[ib_k].ql[ql_idx * 2 + 3] >> ql_shift) & uint16_t(0x0F0F))) |
unpack8(int16_t(((data_a_packed16[ib_k].qh[qh_idx * 2 + 3] >> qh_shift) & uint16_t(0x0303)) << 4))) - int8_t(32);
const i8vec2 vals20 = (unpack8(int16_t((data_a_packed16[ib_k].ql[ql_idx * 2 + 4] >> ql_shift) & uint16_t(0x0F0F))) |
unpack8(int16_t(((data_a_packed16[ib_k].qh[qh_idx * 2 + 4] >> qh_shift) & uint16_t(0x0303)) << 4))) - int8_t(32);
const i8vec2 vals21 = (unpack8(int16_t((data_a_packed16[ib_k].ql[ql_idx * 2 + 5] >> ql_shift) & uint16_t(0x0F0F))) |
unpack8(int16_t(((data_a_packed16[ib_k].qh[qh_idx * 2 + 5] >> qh_shift) & uint16_t(0x0303)) << 4))) - int8_t(32);
const i8vec2 vals30 = (unpack8(int16_t((data_a_packed16[ib_k].ql[ql_idx * 2 + 6] >> ql_shift) & uint16_t(0x0F0F))) |
unpack8(int16_t(((data_a_packed16[ib_k].qh[qh_idx * 2 + 6] >> qh_shift) & uint16_t(0x0303)) << 4))) - int8_t(32);
const i8vec2 vals31 = (unpack8(int16_t((data_a_packed16[ib_k].ql[ql_idx * 2 + 7] >> ql_shift) & uint16_t(0x0F0F))) |
unpack8(int16_t(((data_a_packed16[ib_k].qh[qh_idx * 2 + 7] >> qh_shift) & uint16_t(0x0303)) << 4))) - int8_t(32);
const uvec4 ql = uvec4( uint32_t(data_a_packed16[ib_k].ql[ql_idx * 2 ]) |
(uint32_t(data_a_packed16[ib_k].ql[ql_idx * 2 + 1]) << 16),
uint32_t(data_a_packed16[ib_k].ql[ql_idx * 2 + 2]) |
(uint32_t(data_a_packed16[ib_k].ql[ql_idx * 2 + 3]) << 16),
uint32_t(data_a_packed16[ib_k].ql[ql_idx * 2 + 4]) |
(uint32_t(data_a_packed16[ib_k].ql[ql_idx * 2 + 5]) << 16),
uint32_t(data_a_packed16[ib_k].ql[ql_idx * 2 + 6]) |
(uint32_t(data_a_packed16[ib_k].ql[ql_idx * 2 + 7]) << 16));
return i32vec4(pack32(i8vec4(vals00.x, vals00.y, vals01.x, vals01.y)),
pack32(i8vec4(vals10.x, vals10.y, vals11.x, vals11.y)),
pack32(i8vec4(vals20.x, vals20.y, vals21.x, vals21.y)),
pack32(i8vec4(vals30.x, vals30.y, vals31.x, vals31.y)));
const uvec4 qh = uvec4( uint32_t(data_a_packed16[ib_k].qh[qh_idx * 2 ]) |
(uint32_t(data_a_packed16[ib_k].qh[qh_idx * 2 + 1]) << 16),
uint32_t(data_a_packed16[ib_k].qh[qh_idx * 2 + 2]) |
(uint32_t(data_a_packed16[ib_k].qh[qh_idx * 2 + 3]) << 16),
uint32_t(data_a_packed16[ib_k].qh[qh_idx * 2 + 4]) |
(uint32_t(data_a_packed16[ib_k].qh[qh_idx * 2 + 5]) << 16),
uint32_t(data_a_packed16[ib_k].qh[qh_idx * 2 + 6]) |
(uint32_t(data_a_packed16[ib_k].qh[qh_idx * 2 + 7]) << 16));
const uint vals0 = (( ql.x >> ql_shift) & 0x0F0F0F0F) |
(((qh.x >> qh_shift) & 0x03030303) << 4);
const uint vals1 = (( ql.y >> ql_shift) & 0x0F0F0F0F) |
(((qh.y >> qh_shift) & 0x03030303) << 4);
const uint vals2 = (( ql.z >> ql_shift) & 0x0F0F0F0F) |
(((qh.z >> qh_shift) & 0x03030303) << 4);
const uint vals3 = (( ql.w >> ql_shift) & 0x0F0F0F0F) |
(((qh.w >> qh_shift) & 0x03030303) << 4);
// Subtract 32 by twiddling bits rather than using re-packing as mesa
// compiles repacking poorly.
return i32vec4(int32_t(((vals0 ^ 0x80808080) - 0x20202020) ^ 0x80808080),
int32_t(((vals1 ^ 0x80808080) - 0x20202020) ^ 0x80808080),
int32_t(((vals2 ^ 0x80808080) - 0x20202020) ^ 0x80808080),
int32_t(((vals3 ^ 0x80808080) - 0x20202020) ^ 0x80808080));
}
float get_d_scale(uint ib, uint iqs) {
@@ -662,6 +662,28 @@ void process_shaders() {
}
}
const std::map<std::string, std::string> fa_bf16_dict = {
{"FLOAT_TYPE", "bfloat16_t"},
{"FLOAT_TYPEV2", "bf16vec2"},
{"FLOAT_TYPEV4", "bf16vec4"},
{"ACC_TYPE", "float"},
{"ACC_TYPEV2", "vec2"},
{"ACC_TYPEV4", "vec4"},
{"BFLOAT16", "1"},
};
#if defined(GGML_VULKAN_BFLOAT16_GLSLC_SUPPORT) && defined(GGML_VULKAN_COOPMAT_GLSLC_SUPPORT)
string_to_spv("flash_attn_f32_f16_bf16", "flash_attn_cm1.comp",
merge_maps(fa_bf16_dict, {{"Q_TYPE", "float"}, {"D_TYPE", "float"}, {"D_TYPEV4", "vec4"}, {"COOPMAT", "1"}}),
true, true, false, false);
#endif
#if defined(GGML_VULKAN_BFLOAT16_GLSLC_SUPPORT) && defined(GGML_VULKAN_COOPMAT2_GLSLC_SUPPORT)
string_to_spv("flash_attn_f32_f16_bf16", "flash_attn_cm2.comp",
merge_maps(fa_bf16_dict, {{"Q_TYPE", "float"}, {"D_TYPE", "float"}, {"D_TYPEV4", "vec4"}}),
true, false, true, false);
#endif
std::map<std::string, std::string> base_dict = {{"FLOAT_TYPE", "float"}, {"FLOAT_TYPEV2", "vec2"}};
for (const auto& tname : type_names) {
@@ -934,6 +956,7 @@ void process_shaders() {
string_to_spv("argmax_f32", "argmax.comp", merge_maps(base_dict, {{"A_TYPE", "float"}, {"D_TYPE", "int"}}));
string_to_spv("sum_rows_f32", "sum_rows.comp", merge_maps(base_dict, {{"A_TYPE", "float"}, {"D_TYPE", "float"}}));
string_to_spv("fwht_f32", "fwht.comp", {});
string_to_spv("count_equal_i32", "count_equal.comp", merge_maps(base_dict, {{"A_TYPE", "int"}, {"B_TYPE", "int"}, {"D_TYPE", "int"}}));
string_to_spv("cumsum_f32", "cumsum.comp", merge_maps(base_dict, {{"A_TYPE", "float"}, {"D_TYPE", "float"}}));
string_to_spv("cumsum_multipass1_f32", "cumsum_multipass1.comp", merge_maps(base_dict, {{"A_TYPE", "float"}, {"D_TYPE", "float"}}));

Some files were not shown because too many files have changed in this diff Show More