Compare commits

..

63 Commits

Author SHA1 Message Date
RealOrko 2dcb7f74ed fix: free ctx_copy in ggml_opt_free to plug per-training-session leak (#21592)
* fix: free ctx_copy in ggml_opt_free to plug per-training-session leak

ggml_opt_alloc populates opt_ctx->ctx_copy via a free+init pair every
time the allocated graph shape changes. The last ctx_copy from the
final ggml_opt_alloc call survives until ggml_opt_free is invoked,
but ggml_opt_free was only freeing ctx_static and ctx_cpu, never
ctx_copy. Each opt_ctx lifetime therefore leaks the final per-batch
context — ~900 KB for a typical GNN training session in
sindarin-pkg-tensor, surfaced via AddressSanitizer.

ctx_copy is nullptr-initialized and ggml_free() handles NULL safely,
so the new release is guard-free.

* Update ggml/src/ggml-opt.cpp

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: realorko <realorko@nowhere.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-04-08 17:40:15 +02:00
Yuri Khrustalev 660600081f server: respect the ignore eos flag (#21203) 2026-04-08 17:12:15 +02:00
Aldehir Rojas d9a12c82f0 vocab : remove </s> eog token if gemma4 (#21492) 2026-04-08 09:53:06 -05:00
Georgi Gerganov 4a05e0c566 webui : send both backend_sampling == false/true (#18781)
* webui : send both backend_sampling == false/true

* feat: Parameter sync

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2026-04-08 16:35:52 +02:00
John Eismeier e9fd96283d Propose fix a couple of typos (#21581)
Signed-off-by: John E <jeis4wpi@outlook.com>
2026-04-08 16:29:03 +02:00
Erik Scholz 3ba12fed0a kv-cache : extend cache quantization checks (#21586)
to also check for enabled flash attention, instead of just auto.
2026-04-08 16:08:57 +03:00
Reese Levine 5473949070 webgpu : Query for adapter support when registering WebGPU backend (#21579) 2026-04-08 16:08:29 +03:00
Pasha Khosravi dcdcbad42a metal: Q1_0 backend (#21528)
* initial Q1_0 Metal backend

* tuning q1_0 metal kernels

* add Q1_0 to test-backend-ops

* add Q1_0<->F32 copy test

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-08 16:07:47 +03:00
Georgi Gerganov 5764d7c6a6 gemma : perform per-layer projections in the first layer (#21612)
* gemma : reduce graph splits by keeping per-layer ops in the input layer

* gemma : put the per-layer proj in the first layer

* cont : move the projection before the layer loop
2026-04-08 16:06:30 +03:00
Daniel Bevenius 87f4744a80 examples : disable cb_eval callback for --save-logits (#21553)
This commit updates the debug example to not create the
base_callback_data.

The motivation for this is when using `--save-logits`, which is used by
examples/model-conversion scripts, we often don't care about the tensor
outputs and they just add noise to the output. This changes is quiet by
default we can always remove --save-logits to get the tensor outputs
when debugging.
2026-04-08 14:10:33 +02:00
Piotr Wilkin (ilintar) 85d482e6b6 parser: fix MiniMax handling (#21573) 2026-04-08 12:47:25 +02:00
Georgi Gerganov ae65fbdf33 tests : remove obsolete .mjs script (#21615) 2026-04-08 13:20:46 +03:00
Aleksander Grygier 3bd9aa1f92 chore: Update labeler to have separate labels for server/webui and server changes (#21567) 2026-04-08 10:35:31 +02:00
Aleksander Grygier ece522f98c chore: Remove legacy files (#21606) 2026-04-08 09:55:08 +02:00
forforever73 09343c0198 model : support step3-vl-10b (#21287)
* feat: support step3-vl-10b

* use fused QKV && mapping tensor in tensor_mapping.py

* guard hardcoded params and drop crop metadata

* get understand_projector_stride from global config

* img_u8_resize_bilinear_to_f32 move in step3vl class

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix the \r\n mess

* add width and heads to MmprojModel.set_gguf_parameters

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-04-08 09:51:31 +02:00
Hamish M. Blair 97508acb17 webui: fix syntax highlighting lost after streaming for non-common languages (#21206)
* webui: fix syntax highlighting lost for non-common languages after streaming

rehype-highlight uses lowlight internally, which only bundles 37 "common"
languages. The streaming code path uses highlight.js directly (192 languages),
so languages like Haskell highlight correctly while streaming but lose all
color once the code block closes. Pass the full lowlight language set to
rehype-highlight so both paths support the same languages.

* webui: rebuild static files after rebase
2026-04-08 08:58:08 +02:00
Martin Klacer 5c4aae66e1 devops: kleidiai: provide KleidiAI-Enabled ARM Release Artifact (#21259)
* Unified macOS release setup with strategy-matrix block
 * Added KleidiAI arm64 macOS release definition


Change-Id: I05520889ffc646488a178d06817a17f29274465a

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
2026-04-08 13:06:12 +08:00
Aman Gupta c5ce4bc227 CUDA: make cuda graphs props check faster (#21472)
* CUDA: compute fast hash instead of expensive props check

* use seen node

* use memcp
2026-04-08 09:05:51 +08:00
iacopPBK 66c4f9ded0 ggml-cuda: ds_read_b128 for q4_0 and q4_1 mmq kernels (#21168)
* ds_read_b128 for q4_0 and q4_1 mmq kernels

     Current for loop generates ds_read_b32 instructions with hip compiler, the new solution generates ds_read_b128 instructions for the same operation, saving some LDS bandwidth. Tested on MI50 and RX6800XT, its faster on both.

* Vectorized lds load update: used ggml_cuda_get_max_cpy_bytes and ggml_cuda_memcpy_1 functions for generic implementation

* Explicit for loop in mmq, renamed vec into tmp

* Fixed max_cpy usage in the loading loop

* Fixed typo in q4_1 kernel

* Update ggml/src/ggml-cuda/mmq.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/mmq.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/mmq.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Renoved trailing white line 500

* Update mmq.cuh removed other whitelines

* Remove trailing whitespaces

---------

Co-authored-by: iacopPBK <iacopPBK@users.noreply.github.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: iacopPBK <iacop@deneb.com>
2026-04-07 21:47:42 +02:00
Daniel Bevenius 93bdc61563 gguf-py : fix missing comma after bad merge in tensor-mapping (#21558)
This commit adds a missing comma in the vision encoder attention qkv
block.

The motivation for this change is that without the comma there will be
a string concatenation of the Kimi-K2.5 and the Nemotron Nano v2 VL
tensor mappings which will be broken.
2026-04-07 21:24:25 +02:00
Georgi Gerganov 4eb19514dd kv-cache : support attention rotation for heterogeneous iSWA (#21513)
* kv-cache : support attention rotation for heterogeneous iSWA

* cont : remove assert
2026-04-07 20:31:28 +03:00
Reese Levine 957d717ce5 ggml-webgpu: parameterize submission size and add iOS specific limits (#21533)
* Work towards removing bitcast

* Move rest of existing types over

* Add timeout back to wait and remove synchronous set_tensor/memset_tensor

* move to unpackf16 for wider compatibility

* cleanup

* Remove deadlock condition in free_bufs

* Start work on removing parameter buffer pools

* Simplify and optimize further

* simplify profile futures

* Fix stride

* Try using a single command buffer per batch

* formatting

* Add parameters for different browsers in-flight submissions

* Update handling of batch size too

* Throttle ios as much as possible

* Increase timeout for llvm-pipe testing
2026-04-07 20:30:01 +03:00
Aman Gupta de1aa6fa73 CUDA: check for buffer overlap before fusing (#21566)
* CUDA: check for buffer overlap before fusing

* use ggml_cuda_check_fusion_memory_ranges
2026-04-08 00:57:04 +08:00
Aaron Teo 69c28f1547 llama-server: fix model params not propagated (#21509)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2026-04-07 21:39:41 +08:00
Son H. Nguyen 0d049d6a92 unicode : add custom Qwen2 regex handler to fix segfault on long input (#21257)
* unicode : add custom Qwen2 regex handler to fix segfault on long input

std::regex uses recursive backtracking internally, which causes a stack
overflow (segfault) when tokenizing long sequences of repeated characters
(e.g. 43K 'A's). The Qwen2 tokenizer regex differs from Llama3 only in
the digit pattern (\p{N} vs \p{N}{1,3}), so it was falling through to
the std::regex fallback path instead of using a custom handler.

Add unicode_regex_split_custom_qwen2() following the established pattern
used by gpt2, llama3, kimi_k2, and afmoe custom handlers.

Closes: https://github.com/ggml-org/llama.cpp/issues/21113

* cont : remove TODO comment

* cont : update comment to reflect original regex

* use the correct regex in the comment this time... [no ci]

---------

Co-authored-by: Aldehir Rojas <hello@alde.dev>
2026-04-07 16:13:38 +03:00
Johannes Gäßler a8ec0df461 llama: remove per-arch tensor name lists (#21531) 2026-04-07 15:02:03 +02:00
Georgi Gerganov e8f5082697 server : fix restore for checkpoints with pos_min == 0 (#21510) 2026-04-07 15:29:17 +03:00
Georgi Gerganov 22fc79134e ggml : deprecate GGML_OP_ADD1 (#21363)
* ggml : deprecate GGML_OP_ADD1

* cont : remove tests

* cont : re-enable vulkan check
2026-04-07 15:28:27 +03:00
Tom Overlund 2a619f6fbc ggml: Vulkan build, Linux -- output error string for errno on fork failure (#20868) (#20904) 2026-04-07 13:54:55 +02:00
mkoker edd4d9bca5 vulkan: add FA dequant for q4_1, q5_0, q5_1, iq4_nl (#21029)
Add dequantize4() implementations for Q4_1, Q5_0, Q5_1, and IQ4_NL
in the flash attention base shader. Register them in the shader
generator, pipeline creation, and enable in the scalar/coopmat1 FA
support check.
2026-04-07 13:41:29 +02:00
Aldehir Rojas 482192f12d webui : store reasoning_content so it is sent back in subsequent requests (#21249) 2026-04-07 13:32:44 +02:00
Antoine Viallon 71a81f6fcc ggml-cuda : fix CDNA2 compute capability constant for gfx90a (MI210) (#21519)
GGML_CUDA_CC_CDNA2 was set to 0x910
Fix by setting the constant to 0x90a to match the actual gfx90a ISA.
2026-04-07 12:18:55 +02:00
Aleksander Grygier ecce0087da fix: Detect streaming state in reasoning content blocks (#21549) 2026-04-07 12:04:41 +02:00
Kabir08 d1f82e382d Fix rtl text rendering (#21382)
* Fix Arabic RTL text rendering in web UI

- Add dir='auto' attributes to markdown containers and blocks
- Implement post-processing to add dir='auto' to all text elements
- Replace directional CSS properties with logical properties for proper RTL list alignment
- Ensure bidirectional text support for mixed Arabic/English content

* Clean up commented duplicate function

Remove the commented-out duplicate transformMdastNode function
that was left over from refactoring.

* Fix Arabic RTL text rendering in web UI

- Add dir='auto' attributes to markdown containers and blocks
- Implement post-processing to add dir='auto' to all text elements
- Replace directional CSS properties with logical properties for proper RTL list alignment
- Minor code formatting improvements

This ensures bidirectional text support for mixed Arabic/English content in the llama.cpp web UI.

* Implement rehype plugin for comprehensive RTL text support

- Add rehypeRtlSupport plugin that applies dir='auto' to all elements with children
- Replace DOMParser-based approach with efficient HAST tree processing
- Remove hardcoded element lists for better maintainability
- Ensure proper bidirectional text rendering for mixed RTL/LTR content

* Fix RTL text rendering with rehype plugin and cleanup

* fix: prettier formatting
2026-04-07 11:37:20 +02:00
PMZFX 0988accf82 [SYCL] Add Q8_0 reorder optimization (~3x tg speedup on Intel Arc) (#21527)
Extend the existing reorder optimization to Q8_0. The reorder
separates scale factors from weight data for coalesced memory
access -- was implemented for Q4_0/Q4_K/Q6_K but Q8_0 was missing.

On Arc Pro B70 (Xe2), Q8_0 tg goes from 4.88 to 15.24 t/s (3.1x)
on Qwen3.5-27B. BW utilization: 21% -> 66%.

The key fix beyond the kernels: Q8_0 was missing from the type
check in ggml_backend_sycl_buffer_init_tensor() that allocates
the extra struct carrying the reorder flag -- so the optimization
was silently skipped.

AI (Claude) was used to assist with root cause investigation and
writing the kernel code. All code was human-reviewed and tested
on real hardware.

Fixes: #21517
2026-04-07 16:12:49 +08:00
Dmytro Romanov 0033f53a07 docs: fix typo in build.md (emdawbwebgpu -> emdawnwebgpu) (#21518) 2026-04-07 12:37:26 +08:00
Masashi Yoshimura d0a6dfeb28 ggml-webgpu: Add the support of MUL_MAT_ID (#21147)
* Add mul_mat_id support to WebGPU

* Apply suggestion from @reeselevine

---------

Co-authored-by: Reese Levine <reeselevine1@gmail.com>
2026-04-06 13:08:46 -07:00
Pasha Khosravi 2e1f0a889e ggml: add Q1_0 1-bit quantization support (CPU) (#21273)
* ggml: add Q1_0 and Q1_0_g128 1-bit quantization support (CPU)

* add generic fallback for x86

* remove Q1_0 (group size 32)

* rename Q1_0_g128 => Q1_0

* fix Q1_0 LlamaFileType Enum

* Fix trailing spaces; add generic fallback for othre backends

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix /r/n spacing + arch-fallback

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-04-06 20:55:21 +02:00
Bipin Yadav 506200cf8b cli: fix stripping of \n in multiline input (#21485)
* llama-cli: fix stripping of \n in multiline input

* Change & string to string_view

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Fix EditorConfig linter error

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-04-06 20:54:06 +02:00
Gaurav Garg 15f786e658 [CUDA ] Write an optimized flash_attn_stream_k_fixup kernel (#21159)
* Write an optimized flash_attn_stream_k_fixup kernel

Write a specialized and more optimized kernel for cases where nblocks_stream_k is multiple of ntiles_dst.
Make nblocks_stream_k to multiple of ntiles_dst if nblocks_stream_k > 2 * ntiles_dst

* Use the new kernel only for nblocks_stream_k_raw > 4 * ntiles_dst to make sure we have enough concurrency on GPUs

* Address review comments

* Address review comments

* Revert variable names to original
2026-04-06 20:34:29 +02:00
Aman Gupta 94ca829b60 llama-bench: add -fitc and -fitt to arguments (#21304)
* llama-bench: add `-fitc` and `-fitt` to arguments

* update README.md

* address review comments

* update compare-llama-bench.py
2026-04-06 22:26:02 +08:00
Aldehir Rojas 4aa962e2b0 vocab : add byte token handling to BPE detokenizer for Gemma4 (#21488) 2026-04-06 09:08:37 -05:00
Sigbjørn Skjæret 941146b3f1 convert : fix block_ff_dim retrieval for lfm2 (#21508) 2026-04-06 14:05:18 +02:00
lainon1 482d862bcb server : handle unsuccessful sink.write in chunked stream provider (#21478)
Check the return value of sink.write() in the chunked content provider
and return false when the write fails, matching cpp-httplib's own
streaming contract. This prevents logging chunks as sent when the sink
rejected them and properly aborts the stream on connection failure.
2026-04-06 14:03:02 +02:00
Xuan-Son Nguyen 3979f2bb08 docs: add hunyuan-ocr gguf, also add test [no ci] (#21490) 2026-04-06 14:02:37 +02:00
Georgi Gerganov 400ac8e194 convert : set "add bos" == True for Gemma 4 (#21500)
* convert : set "add bos" == True for Gemma 4

* cont : handle old GGUFs
2026-04-06 13:52:07 +03:00
Neo Zhang f51fd36d79 sycl : handle other FA case (#21377) 2026-04-06 13:28:00 +03:00
Yarden Tal 25eec6f327 hexagon: slight optimization for argosrt output init (#21463) 2026-04-05 18:30:25 -07:00
anchortense 58190cc84d llama : correct platform-independent loading of BOOL metadata (#21428)
* model-loader : fix GGUF bool array conversion

* model-loader : fix remaining GGUF bool pointer uses
2026-04-06 01:40:38 +02:00
Richard Davison af76639f72 model : add HunyuanOCR support (#21395)
* HunyuanOCR: add support for text and vision models

- Add HunyuanOCR vision projector (perceiver-based) with Conv2d merge
- Add separate HUNYUAN_OCR chat template (content-before-role format)
- Handle HunyuanOCR's invalid pad_token_id=-1 in converter
- Fix EOS/EOT token IDs from generation_config.json
- Support xdrope RoPE scaling type
- Add tensor mappings for perceiver projector (mm.before_rms, mm.after_rms, etc.)
- Register HunYuanVLForConditionalGeneration for both text and mmproj conversion

* fix proper mapping

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* address comments

* update

* Fix typecheck

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-04-05 23:32:14 +02:00
Ludovic Henry 761797ffdf ci : use default RISE RISC-V Runners (#21263) 2026-04-05 20:29:48 +02:00
ddh0 5d3a4a7da5 server : fix logging of build + system info (#21460)
This PR changes the logging that occurs at startup of llama-server.
Currently, it is redundant (including CPU information twice) and it is
missing the build + commit info.
2026-04-05 16:14:02 +02:00
M1DNYT3 c08d28d088 ci: lower cuda12 floor to 12.8.1 for broader host compatibility (#21438)
Co-authored-by: M1DNYT3 <m1dnyt3@MacBookPro.lan>
2026-04-05 09:04:00 +08:00
Nicholas Sparks 661e9acb36 ci: fix vulkan workflow referencing non-existent action (#21442) 2026-04-05 08:59:51 +08:00
Aldehir Rojas b8635075ff common : add gemma 4 specialized parser (#21418)
* common : add gemma4 dedicated parser

* cont : add '<|tool_response>' as eog

* cont : emit JSON from Gemma4 tool call AST

* cont : more fixes

* cont : refactor convert function

* cont : refine rules and mapping

* cont : add more tests

* cont : clean up

* cont : remove autoparser gemma4 implementation

* cont : more cleanup

* cont : rename gemma4.jinja to match the others

* cont : add custom template to support interleaved thinking

* cont : preserve reasoning in model turns

* cont : fix initializer error

* cont : fix unused vars

* cont : fix accidental static

* cont : fix specialized_template signature

* fix extra semicolon

* remove debug line and extra space [no ci]
2026-04-04 20:39:00 +02:00
Dan Hoffman 9c699074c9 server: Fix undefined timing measurement errors in server context (#21201)
Co-authored-by: Dan Hoffman <dhoffman@cyket.net>
2026-04-04 22:11:19 +08:00
Adrien Gallouët d01f6274c0 common : respect specified tag, only fallback when tag is empty (#21413)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-04-04 15:08:03 +02:00
SamareshSingh 650bf14eb9 llama-model: read final_logit_softcapping for Gemma 4 (#21390) 2026-04-04 13:05:10 +02:00
Aman Gupta b7ad48ebda llama: add custom newline split for Gemma 4 (#21406) 2026-04-04 15:06:34 +08:00
Reese Levine d006858316 ggml-webgpu: move from parameter buffer pool to single buffer with offsets (#21278)
* Work towards removing bitcast

* Move rest of existing types over

* Add timeout back to wait and remove synchronous set_tensor/memset_tensor

* move to unpackf16 for wider compatibility

* cleanup

* Remove deadlock condition in free_bufs

* Start work on removing parameter buffer pools

* Simplify and optimize further

* simplify profile futures

* Fix stride

* Try using a single command buffer per batch

* formatting
2026-04-03 11:40:14 -07:00
Masato Nakasaka e439700992 ci: Add Windows Vulkan backend testing on Intel (#21292)
* experimenting CI

* Experimenting CI fix for MinGW

* experimenting CI on Windows

* modified script for integration with VisualStudio

* added proxy handling

* adding python version for Windows execution

* fix iterator::end() dereference

* fixed proxy handling

* Fix errors occurring on Windows

* fixed ci script

* Reverted to master

* Stripping test items to simplify Windows test

* adjusting script for windows testing

* Changed shell

* Fixed shell

* Fixed shell

* Fix CI setting

* Fix CI setting

* Fix CI setting

* Experimenting ci fix

* Experimenting ci fix

* Experimenting ci fix

* Experimenting ci fix

* experimenting fix for unit test error

* Changed to use BUILD_LOW_PERF to skip python tests

* Fix CI

* Added option to specify Ninja generator

* Reverted proxy related changes
2026-04-03 20:16:44 +03:00
Yes You Can Have Your Own 50e0ad08fb server: save and clear idle slots on new task (--clear-idle) (#20993)
* server: clear idle slots KV from VRAM (LLAMA_KV_KEEP_ONLY_ACTIVE)

* server: move idle slot KV clearing to slot release

The save "cost" is now paid by the finishing request.

* server: add --kv-clear-idle flag, enable by default

* server: skip clearing last idle slot, clear on launch

* server: test --no-kv-clear-idle flag

* server: simplify on-release clearing loop

* server: remove on-release KV clearing, keep launch-only

* cont : clean-up

* tests: update log strings after --clear-idle rename

* tests: use debug tags instead of log message matching

* test: fix Windows CI by dropping temp log file unlink

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-03 19:02:27 +02:00
Piotr Wilkin (ilintar) f1f793ad06 common/parser: fix call ID detection (Mistral parser mostly) + atomicity for tag-json parsers (#21230)
* Fix call ID detection (Mistral parser mostly) + atomicity for tag-json parsers

* Rename

* Update common/chat-auto-parser-generator.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-04-03 17:51:52 +02:00
174 changed files with 5697 additions and 15086 deletions
+18 -2
View File
@@ -73,10 +73,26 @@ android:
- changed-files:
- any-glob-to-any-file:
- examples/llama.android/**
server/webui:
- changed-files:
- all:
- any-glob-to-any-file:
- tools/server/webui/**
- tools/server/public/**
- all-globs-to-all-files:
- '!tools/server/webui/**'
- '!tools/server/public/**'
server:
- changed-files:
- any-glob-to-any-file:
- tools/server/**
- all:
- any-glob-to-any-file:
- tools/server/**
- all-globs-to-all-files:
- '!tools/server/webui/**'
- '!tools/server/public/**'
ggml:
- changed-files:
- any-glob-to-any-file:
+14 -24
View File
@@ -35,7 +35,7 @@ env:
jobs:
ubuntu-riscv64-native-sanitizer:
runs-on: RISCV64
runs-on: ubuntu-24.04-riscv
continue-on-error: true
@@ -50,17 +50,18 @@ jobs:
sudo apt-get update
# Install necessary packages
sudo apt-get install -y libatomic1 libtsan2 gcc-14 g++-14 rustup cmake build-essential wget ccache git-lfs
sudo apt-get install -y libatomic1 libtsan2 gcc-14 g++-14 cmake build-essential wget git-lfs
# Set gcc-14 and g++-14 as the default compilers
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-14 100
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-14 100
sudo ln -sf /usr/bin/gcc-14 /usr/bin/gcc
sudo ln -sf /usr/bin/g++-14 /usr/bin/g++
# Install Rust stable version
rustup install stable
rustup default stable
if ! which rustc; then
# Install Rust stable version
sudo apt-get install -y rustup
rustup install stable
rustup default stable
fi
git lfs install
@@ -73,23 +74,12 @@ jobs:
id: checkout
uses: actions/checkout@v6
- name: Setup ccache
run: |
# Unique cache directory per matrix combination
export CCACHE_DIR="$HOME/.ccache/sanitizer-${{ matrix.sanitizer }}-${{ matrix.build_type }}"
mkdir -p "$CCACHE_DIR"
# Configure ccache
ccache --set-config=max_size=5G
ccache --set-config=compression=true
ccache --set-config=compression_level=6
ccache --set-config=cache_dir="$CCACHE_DIR"
ccache --set-config=sloppiness=file_macro,time_macros,include_file_mtime,include_file_ctime
ccache --set-config=hash_dir=false
# Export for subsequent steps
echo "CCACHE_DIR=$CCACHE_DIR" >> $GITHUB_ENV
echo "PATH=/usr/lib/ccache:$PATH" >> $GITHUB_ENV
# FIXME: Enable when ggml-org/ccache-action works on riscv64
# - name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: ubuntu-riscv64-native-sanitizer-${{ matrix.sanytizer }}-${{ matrix.build_type }}
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
+21
View File
@@ -213,6 +213,27 @@ jobs:
vulkaninfo --summary
GG_BUILD_VULKAN=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-win-intel-vulkan:
runs-on: [self-hosted, Windows, X64, Intel]
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Test
id: ggml-ci
shell: C:\msys64\usr\bin\bash.exe --noprofile --norc -eo pipefail "{0}"
env:
MSYSTEM: UCRT64
CHERE_INVOKING: 1
PATH: C:\msys64\ucrt64\bin;C:\msys64\usr\bin;C:\Windows\System32;${{ env.PATH }}
run: |
vulkaninfo --summary
# Skip python related tests with GG_BUILD_LOW_PERF=1 since Windows MSYS2 UCRT64 currently fails to create
# a valid python environment for testing
LLAMA_FATAL_WARNINGS=OFF GG_BUILD_NINJA=1 GG_BUILD_VULKAN=1 GG_BUILD_LOW_PERF=1 ./ci/run.sh ./results/llama.cpp ./mnt/llama.cpp
ggml-ci-intel-openvino-gpu-low-perf:
runs-on: [self-hosted, Linux, Intel, OpenVINO]
+1 -1
View File
@@ -72,7 +72,7 @@ jobs:
- name: Setup Vulkan SDK
if: steps.cache-sdk.outputs.cache-hit != 'true'
uses: ./.github/actions/linux-setup-vulkan-llvmpipe
uses: ./.github/actions/linux-setup-vulkan
with:
path: ./vulkan_sdk
version: ${{ env.VULKAN_SDK_VERSION }}
+18 -29
View File
@@ -996,7 +996,7 @@ jobs:
cmake --build build -j ${env:NUMBER_OF_PROCESSORS}
ubuntu-cpu-riscv64-native:
runs-on: RISCV64
runs-on: ubuntu-24.04-riscv
steps:
- name: Install dependencies
@@ -1004,24 +1004,21 @@ jobs:
sudo apt-get update
# Install necessary packages
sudo apt-get install -y libatomic1 libtsan2 gcc-14 g++-14 rustup cmake build-essential libssl-dev wget ccache git-lfs
sudo apt-get install -y libatomic1 libtsan2 gcc-14 g++-14 cmake build-essential libssl-dev wget git-lfs
# Set gcc-14 and g++-14 as the default compilers
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-14 100
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-14 100
sudo ln -sf /usr/bin/gcc-14 /usr/bin/gcc
sudo ln -sf /usr/bin/g++-14 /usr/bin/g++
# Install Rust stable version
rustup install stable
rustup default stable
if ! which rustc; then
# Install Rust stable version
sudo apt-get install -y rustup
rustup install stable
rustup default stable
fi
git lfs install
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Check environment
run: |
uname -a
@@ -1031,25 +1028,17 @@ jobs:
cmake --version
rustc --version
- name: Setup ccache
run: |
# Set unique cache directory for this job
export CCACHE_DIR="$HOME/.ccache/cpu-cmake-rv64-native"
mkdir -p "$CCACHE_DIR"
- name: Clone
id: checkout
uses: actions/checkout@v6
# Configure ccache for optimal performance
ccache --set-config=max_size=5G
ccache --set-config=compression=true
ccache --set-config=compression_level=6
ccache --set-config=cache_dir="$CCACHE_DIR"
# Enable more aggressive caching
ccache --set-config=sloppiness=file_macro,time_macros,include_file_mtime,include_file_ctime
ccache --set-config=hash_dir=false
# Export for subsequent steps
echo "CCACHE_DIR=$CCACHE_DIR" >> $GITHUB_ENV
echo "PATH=/usr/lib/ccache:$PATH" >> $GITHUB_ENV
# FIXME: Enable when ggml-org/ccache-action works on riscv64
# - name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: ubuntu-cpu-riscv64-native
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
+2 -2
View File
@@ -73,8 +73,8 @@ jobs:
{ "tag": "cpu", "dockerfile": ".devops/cpu.Dockerfile", "platforms": "linux/amd64", "full": true, "light": true, "server": true, "free_disk_space": false, "runs_on": "ubuntu-24.04" },
{ "tag": "cpu", "dockerfile": ".devops/cpu.Dockerfile", "platforms": "linux/arm64", "full": true, "light": true, "server": true, "free_disk_space": false, "runs_on": "ubuntu-24.04-arm" },
{ "tag": "cpu", "dockerfile": ".devops/s390x.Dockerfile", "platforms": "linux/s390x", "full": true, "light": true, "server": true, "free_disk_space": false, "runs_on": "ubuntu-24.04-s390x" },
{ "tag": "cuda cuda12", "dockerfile": ".devops/cuda.Dockerfile", "cuda_version": "12.9.1", "platforms": "linux/amd64", "full": true, "light": true, "server": true, "free_disk_space": true, "runs_on": "ubuntu-24.04" },
{ "tag": "cuda cuda12", "dockerfile": ".devops/cuda.Dockerfile", "cuda_version": "12.9.1", "platforms": "linux/arm64", "full": true, "light": true, "server": true, "free_disk_space": true, "runs_on": "ubuntu-24.04-arm" },
{ "tag": "cuda cuda12", "dockerfile": ".devops/cuda.Dockerfile", "cuda_version": "12.8.1", "platforms": "linux/amd64", "full": true, "light": true, "server": true, "free_disk_space": true, "runs_on": "ubuntu-24.04" },
{ "tag": "cuda cuda12", "dockerfile": ".devops/cuda.Dockerfile", "cuda_version": "12.8.1", "platforms": "linux/arm64", "full": true, "light": true, "server": true, "free_disk_space": true, "runs_on": "ubuntu-24.04-arm" },
{ "tag": "cuda13", "dockerfile": ".devops/cuda.Dockerfile", "cuda_version": "13.1.1", "platforms": "linux/amd64", "full": true, "light": true, "server": true, "free_disk_space": true, "runs_on": "ubuntu-24.04" },
{ "tag": "cuda13", "dockerfile": ".devops/cuda.Dockerfile", "cuda_version": "13.1.1", "platforms": "linux/arm64", "full": true, "light": true, "server": true, "free_disk_space": true, "runs_on": "ubuntu-24.04-arm" },
{ "tag": "musa", "dockerfile": ".devops/musa.Dockerfile", "platforms": "linux/amd64", "full": true, "light": true, "server": true, "free_disk_space": true, "runs_on": "ubuntu-24.04" },
+27 -59
View File
@@ -36,8 +36,26 @@ env:
CMAKE_ARGS: "-DLLAMA_BUILD_EXAMPLES=OFF -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_TOOLS=ON -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON"
jobs:
macOS-arm64:
runs-on: macos-14
macOS-cpu:
strategy:
matrix:
include:
- build: 'arm64'
arch: 'arm64'
os: macos-14
defines: "-DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON"
- build: 'arm64-kleidiai'
arch: 'arm64'
os: macos-14
defines: "-DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON -DGGML_CPU_KLEIDIAI=ON"
- build: 'x64'
arch: 'x64'
os: macos-15-intel
# Metal is disabled on x64 due to intermittent failures with Github runners not having a GPU:
# https://github.com/ggml-org/llama.cpp/actions/runs/8635935781/job/23674807267#step:5:2313
defines: "-DGGML_METAL=OFF -DCMAKE_OSX_DEPLOYMENT_TARGET=13.3"
runs-on: ${{ matrix.os }}
steps:
- name: Clone
@@ -49,7 +67,7 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: macOS-latest-arm64
key: macOS-latest-${{ matrix.arch }}
evict-old-files: 1d
- name: Build
@@ -57,13 +75,11 @@ jobs:
run: |
sysctl -a
cmake -B build \
${{ matrix.defines }} \
-DCMAKE_INSTALL_RPATH='@loader_path' \
-DCMAKE_BUILD_WITH_INSTALL_RPATH=ON \
-DLLAMA_FATAL_WARNINGS=ON \
-DLLAMA_BUILD_BORINGSSL=ON \
-DGGML_METAL_USE_BF16=ON \
-DGGML_METAL_EMBED_LIBRARY=ON \
-DGGML_RPC=ON \
${{ env.CMAKE_ARGS }}
cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
@@ -75,61 +91,13 @@ jobs:
id: pack_artifacts
run: |
cp LICENSE ./build/bin/
tar -czvf llama-${{ steps.tag.outputs.name }}-bin-macos-arm64.tar.gz -s ",./,llama-${{ steps.tag.outputs.name }}/," -C ./build/bin .
tar -czvf llama-${{ steps.tag.outputs.name }}-bin-macos-${{ matrix.build }}.tar.gz -s ",./,llama-${{ steps.tag.outputs.name }}/," -C ./build/bin .
- name: Upload artifacts
uses: actions/upload-artifact@v6
with:
path: llama-${{ steps.tag.outputs.name }}-bin-macos-arm64.tar.gz
name: llama-bin-macos-arm64.tar.gz
macOS-x64:
runs-on: macos-15-intel
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
with:
fetch-depth: 0
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: macOS-latest-x64
evict-old-files: 1d
- name: Build
id: cmake_build
run: |
sysctl -a
# Metal is disabled due to intermittent failures with Github runners not having a GPU:
# https://github.com/ggml-org/llama.cpp/actions/runs/8635935781/job/23674807267#step:5:2313
cmake -B build \
-DCMAKE_INSTALL_RPATH='@loader_path' \
-DCMAKE_BUILD_WITH_INSTALL_RPATH=ON \
-DLLAMA_FATAL_WARNINGS=ON \
-DLLAMA_BUILD_BORINGSSL=ON \
-DGGML_METAL=OFF \
-DGGML_RPC=ON \
-DCMAKE_OSX_DEPLOYMENT_TARGET=13.3
cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
- name: Determine tag name
id: tag
uses: ./.github/actions/get-tag-name
- name: Pack artifacts
id: pack_artifacts
run: |
cp LICENSE ./build/bin/
tar -czvf llama-${{ steps.tag.outputs.name }}-bin-macos-x64.tar.gz -s ",./,llama-${{ steps.tag.outputs.name }}/," -C ./build/bin .
- name: Upload artifacts
uses: actions/upload-artifact@v6
with:
path: llama-${{ steps.tag.outputs.name }}-bin-macos-x64.tar.gz
name: llama-bin-macos-x64.tar.gz
path: llama-${{ steps.tag.outputs.name }}-bin-macos-${{ matrix.build }}.tar.gz
name: llama-bin-macos-${{ matrix.build }}.tar.gz
ubuntu-cpu:
strategy:
@@ -1003,8 +971,7 @@ jobs:
- ubuntu-cpu
- ubuntu-vulkan
- ubuntu-24-openvino
- macOS-arm64
- macOS-x64
- macOS-cpu
- ios-xcode-build
- openEuler-cann
@@ -1079,6 +1046,7 @@ jobs:
**macOS/iOS:**
- [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-macos-arm64.tar.gz)
- [macOS Apple Silicon (arm64, KleidiAI enabled)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-macos-arm64-kleidiai.tar.gz)
- [macOS Intel (x64)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-macos-x64.tar.gz)
- [iOS XCFramework](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-xcframework.zip)
+5
View File
@@ -119,6 +119,11 @@ if [ ! -z ${GG_BUILD_VULKAN} ]; then
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_METAL=OFF -DGGML_BLAS=OFF"
fi
# Build shared libs on Windows
# to reduce binary size and avoid errors in library loading unit tests
if uname -s | grep -qi nt; then
CMAKE_EXTRA="${CMAKE_EXTRA} -DBUILD_SHARED_LIBS=ON"
fi
fi
if [ ! -z ${GG_BUILD_WEBGPU} ]; then
+8
View File
@@ -1311,6 +1311,14 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
params.kv_unified = value;
}
).set_env("LLAMA_ARG_KV_UNIFIED").set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_PERPLEXITY, LLAMA_EXAMPLE_BATCHED, LLAMA_EXAMPLE_BENCH, LLAMA_EXAMPLE_PARALLEL}));
add_opt(common_arg(
{"--clear-idle"},
{"--no-clear-idle"},
"save and clear idle slots on new task (default: enabled, requires unified KV and cache-ram)",
[](common_params & params, bool value) {
params.clear_idle = value;
}
).set_env("LLAMA_ARG_CLEAR_IDLE").set_examples({LLAMA_EXAMPLE_SERVER}));
add_opt(common_arg(
{"--context-shift"},
{"--no-context-shift"},
+75 -290
View File
@@ -6,110 +6,13 @@
#include "json-schema-to-grammar.h"
#include "log.h"
#include "nlohmann/json.hpp"
#include "peg-parser.h"
#include <algorithm>
#include <stdexcept>
#include <string>
using json = nlohmann::ordered_json;
namespace {
// Gemma4-specific PEG builder extending the standard chat builder.
// Adds value type parsers that use <|\"|> as string delimiters
// instead of JSON's double quotes, and disables json-to-schema
// conversion for these types.
class common_peg_gemma4_builder {
common_chat_peg_builder & p_;
static constexpr const char * QUOTE = "<|\"|>";
public:
explicit common_peg_gemma4_builder(common_chat_peg_builder & p) : p_(p) {}
common_peg_parser gemma4_string() {
return p_.rule("gemma4-string", [&]() {
return p_.literal(QUOTE) + p_.until(QUOTE) + p_.literal(QUOTE);
});
}
common_peg_parser gemma4_number() {
return p_.rule("gemma4-number", [&]() {
auto digit1_9 = p_.chars("[1-9]", 1, 1);
auto digits = p_.chars("[0-9]");
auto int_part = p_.choice({p_.literal("0"), p_.sequence({digit1_9, p_.chars("[0-9]", 0, -1)})});
auto frac = p_.sequence({p_.literal("."), digits});
auto exp = p_.sequence({p_.choice({p_.literal("e"), p_.literal("E")}),
p_.optional(p_.chars("[+-]", 1, 1)), digits});
auto not_number_continuation = p_.negate(p_.chars("[0-9.eE+-]", 1, 1));
return p_.sequence({p_.optional(p_.literal("-")), int_part, p_.optional(frac),
p_.optional(exp), not_number_continuation});
});
}
common_peg_parser gemma4_bool() {
return p_.rule("gemma4-bool", [&]() {
return p_.choice({p_.literal("true"), p_.literal("false")});
});
}
common_peg_parser gemma4_null() {
return p_.rule("gemma4-null", [&]() {
return p_.literal("null");
});
}
common_peg_parser gemma4_dict() {
return p_.rule("gemma4-dict", [&]() {
auto ws = p_.space();
auto key = p_.until(":");
auto member = p_.sequence({key, p_.literal(":"), ws, gemma4_value()});
auto members = p_.sequence({member, p_.zero_or_more(p_.sequence({p_.literal(","), ws, member}))});
return p_.sequence({
p_.literal("{"), ws,
p_.choice({p_.literal("}"), p_.sequence({members, ws, p_.literal("}")})})
});
});
}
common_peg_parser gemma4_array() {
return p_.rule("gemma4-array", [&]() {
auto ws = p_.space();
auto elements = p_.sequence({gemma4_value(), p_.zero_or_more(p_.sequence({p_.literal(","), ws, gemma4_value()}))});
return p_.sequence({
p_.literal("["), ws,
p_.choice({p_.literal("]"), p_.sequence({elements, ws, p_.literal("]")})})
});
});
}
common_peg_parser gemma4_value() {
return p_.rule("gemma4-value", [&]() {
return p_.choice({gemma4_string(), gemma4_dict(), gemma4_array(),
gemma4_number(), gemma4_bool(), gemma4_null()});
});
}
// Select the appropriate value parser based on JSON schema type.
// Does NOT use schema() - the gemma4 types are pure PEG without
// JSON schema metadata, so GBNF is generated directly from the
// PEG structure.
common_peg_parser gemma4_value_for_type(const json & schema) {
if (!schema.contains("type") || !schema.at("type").is_string()) {
return gemma4_value();
}
std::string type = schema.at("type").get<std::string>();
if (type == "string") { return gemma4_string(); }
if (type == "number") { return gemma4_number(); }
if (type == "integer") { return gemma4_number(); }
if (type == "boolean") { return gemma4_bool(); }
if (type == "object") { return gemma4_dict(); }
if (type == "array") { return gemma4_array(); }
return gemma4_value();
}
};
} // anonymous namespace
// Helper to iterate over tools/functions
static void foreach_function(const json & tools, const std::function<void(const json &)> & fn) {
for (const auto & tool : tools) {
@@ -141,9 +44,7 @@ common_chat_params peg_generator::generate_parser(const common_chat_template &
// Create the result structure
common_chat_params data;
data.prompt = common_chat_template_direct_apply(tmpl, inputs);
data.format = (autoparser.tools.format.mode == tool_format::TAG_WITH_GEMMA4_DICT)
? COMMON_CHAT_FORMAT_PEG_GEMMA4
: COMMON_CHAT_FORMAT_PEG_NATIVE;
data.format = COMMON_CHAT_FORMAT_PEG_NATIVE;
data.preserved_tokens = autoparser.preserved_tokens;
auto parser = autoparser.build_parser(inputs);
@@ -270,8 +171,6 @@ common_peg_parser analyze_tools::build_parser(parser_build_context & ctx) const
return build_tool_parser_tag_json(ctx);
case tool_format::TAG_WITH_TAGGED:
return build_tool_parser_tag_tagged(ctx);
case tool_format::TAG_WITH_GEMMA4_DICT:
return build_tool_parser_tag_gemma4_dict(ctx);
default:
LOG_ERR("[ERROR] Template seems to support tool calls, but failed to determine tool format. Tool calling will not work properly. "
"Check for a fixed template for your model in the models/templates directory of your llama.cpp installation or "
@@ -317,6 +216,44 @@ common_peg_parser analyze_tools::build_tool_parser_json_native(parser_build_cont
p.end();
}
common_peg_parser analyze_tools::build_func_parser(common_chat_peg_builder & p, const std::string & name,
const common_peg_parser & call_id_section, bool have_call_id,
const common_peg_parser & args,
std::optional<common_peg_parser> atomic_peek) const {
auto open = p.tool_open(function.name_prefix + p.tool_name(p.literal(name)) + function.name_suffix);
bool matched_atomic = false;
common_peg_parser func_parser = p.eps();
if (!function.name_suffix.empty()) {
func_parser = open + call_id_section + p.space() + args;
matched_atomic = true;
} else if (have_call_id) {
func_parser = p.atomic(open + call_id_section) + p.space() + args;
matched_atomic = true;
} else if (atomic_peek.has_value()) {
func_parser = p.atomic(open + call_id_section + p.space() + *atomic_peek) + args;
matched_atomic = true;
} else {
func_parser = open + call_id_section + p.space() + args;
}
if (!function.close.empty()) {
func_parser = func_parser + p.space() + p.tool_close(p.literal(function.close));
} else if (!format.per_call_end.empty()) {
// When there's no func_close but there is a per_call_end marker, use peek() to ensure
// we only emit tool_close when we can actually see the closing marker. This prevents
// premature closing during partial parsing when we've seen e.g. "</" which could be
// either "</tool_call>" (end) or "<arg_key>" prefix that failed to match.
func_parser = func_parser + p.tool_close(p.peek(p.literal(format.per_call_end)));
} else {
func_parser = func_parser + p.tool_close(p.space()); // force this to process tool closing callbacks in mapper
}
if (!matched_atomic) {
func_parser = p.atomic(func_parser);
}
return func_parser;
}
common_peg_parser analyze_tools::build_tool_parser_tag_json(parser_build_context & ctx) const {
auto & p = ctx.p;
const auto & inputs = ctx.inputs;
@@ -330,17 +267,27 @@ common_peg_parser analyze_tools::build_tool_parser_tag_json(parser_build_context
const auto & schema = func.contains("parameters") ? func.at("parameters") : json::object();
// Build call_id parser based on position (if supported)
bool have_call_id = false;
common_peg_parser call_id_section = p.eps();
if (call_id.pos == call_id_position::BETWEEN_FUNC_AND_ARGS && !call_id.prefix.empty() &&
!call_id.suffix.empty()) {
call_id_section = p.optional(call_id.prefix + p.tool_id(p.until(call_id.suffix))) + call_id.suffix;
(!call_id.suffix.empty() || !arguments.start.empty())) {
if (!call_id.suffix.empty()) {
call_id_section = p.optional(call_id.prefix + p.tool_id(p.until(call_id.suffix))) + call_id.suffix;
} else {
call_id_section = p.optional(call_id.prefix + p.tool_id(p.until(arguments.start)));
}
have_call_id = true;
}
auto args_parser = p.tool_args(p.schema(p.json(), "tool-" + name + "-schema", schema));
if (!arguments.start.empty()) {
args_parser = p.literal(arguments.start) + args_parser;
}
if (!arguments.end.empty()) {
args_parser = args_parser + p.literal(arguments.end);
}
auto func_parser = p.tool_open(function.name_prefix + p.tool_name(p.literal(name)) + function.name_suffix) +
call_id_section + p.tool_args(p.schema(p.json(), "tool-" + name + "-schema", schema));
if (!function.close.empty()) {
func_parser = func_parser + function.close;
}
auto atomic_peek = !arguments.start.empty() ? std::optional(p.peek(p.literal(arguments.start))) : std::nullopt;
auto func_parser = build_func_parser(p, name, call_id_section, have_call_id, args_parser, atomic_peek);
tool_choice |= p.rule("tool-" + name, func_parser);
});
@@ -470,52 +417,31 @@ common_peg_parser analyze_tools::build_tool_parser_tag_tagged(parser_build_conte
args_seq = args_seq + p.repeat(p.space() + any_opt, 0, (int) optional_parsers.size());
}
if (!arguments.start.empty()) {
args_seq = p.literal(arguments.start) + args_seq;
}
if (!arguments.end.empty()) {
args_seq = args_seq + p.literal(arguments.end);
}
// Build call_id parser based on position (if supported)
common_peg_parser call_id_section = p.eps();
bool have_call_id = false;
if (call_id.pos == call_id_position::BETWEEN_FUNC_AND_ARGS && !call_id.prefix.empty() &&
!call_id.suffix.empty()) {
(!call_id.suffix.empty() || !arguments.start.empty())) {
have_call_id = true;
call_id_section = p.optional(call_id.prefix + p.tool_id(p.until(call_id.suffix)) + call_id.suffix);
}
bool matched_atomic = false;
common_peg_parser func_parser = p.eps();
if (!function.name_suffix.empty()) {
func_parser = p.tool_open(function.name_prefix + p.tool_name(p.literal(name)) + function.name_suffix) +
call_id_section + p.space() + args_seq;
matched_atomic = true;
} else if (have_call_id) {
func_parser = p.atomic(p.tool_open(function.name_prefix + p.tool_name(p.literal(name)) + function.name_suffix) +
call_id_section) + p.space() + args_seq;
matched_atomic = true;
} else if (!arguments.name_prefix.empty() && !required_parsers.empty()) {
// Only peek for an arg tag when there are required args that must follow.
// When all args are optional, the model may emit no arg tags at all (#20650).
func_parser = p.atomic(p.tool_open(function.name_prefix + p.tool_name(p.literal(name)) + function.name_suffix) +
call_id_section + p.space() + p.peek(p.literal(arguments.name_prefix))) + args_seq;
matched_atomic = true;
} else {
func_parser = p.tool_open(function.name_prefix + p.tool_name(p.literal(name)) + function.name_suffix) +
call_id_section + p.space() + args_seq;
}
if (!function.close.empty()) {
func_parser = func_parser + p.space() + p.tool_close(p.literal(function.close));
} else if (!format.per_call_end.empty()) {
// When there's no func_close but there is a per_call_end marker, use peek() to ensure
// we only emit tool_close when we can actually see the closing marker. This prevents
// premature closing during partial parsing when we've seen e.g. "</" which could be
// either "</tool_call>" (end) or "<arg_key>" prefix that failed to match.
func_parser = func_parser + p.tool_close(p.peek(p.literal(format.per_call_end)));
} else {
func_parser =
func_parser + p.tool_close(p.space()); // force this to process tool closing callbacks in mapper
}
if (!matched_atomic) {
func_parser = p.atomic(func_parser);
if (!call_id.suffix.empty()) {
call_id_section = p.optional(call_id.prefix + p.tool_id(p.until(call_id.suffix)) + call_id.suffix);
} else {
call_id_section = p.optional(call_id.prefix + p.tool_id(p.until(arguments.start)));
}
}
// Only peek for an arg tag when there are required args that must follow.
// When all args are optional, the model may emit no arg tags at all (#20650).
auto atomic_peek = (!arguments.name_prefix.empty() && !required_parsers.empty()) ?
std::optional(p.peek(p.literal(arguments.name_prefix))) : std::nullopt;
auto func_parser = build_func_parser(p, name, call_id_section, have_call_id, args_seq, atomic_peek);
tool_choice |= p.rule("tool-" + name, func_parser);
});
@@ -558,145 +484,4 @@ common_peg_parser analyze_tools::build_tool_parser_tag_tagged(parser_build_conte
p.end();
}
common_peg_parser analyze_tools::build_tool_parser_tag_gemma4_dict(parser_build_context & ctx) const {
auto & p = ctx.p;
const auto & inputs = ctx.inputs;
bool force_tools = inputs.tool_choice == COMMON_CHAT_TOOL_CHOICE_REQUIRED;
common_peg_gemma4_builder g4(p);
static const std::string QUOTE = "<|\"|>";
common_peg_parser tool_choice = p.choice();
foreach_function(inputs.tools, [&](const json & tool) {
const auto & func = tool.at("function");
std::string name = func.at("name");
const auto & params = func.at("parameters");
if (!params.contains("properties") || !params.at("properties").is_object()) {
auto func_parser = p.atomic(
p.tool_open(p.literal(function.name_prefix) + p.tool_name(p.literal(name)) + p.literal("{")) +
p.tool_args(p.eps()) +
p.tool_close(p.literal("}")));
tool_choice |= p.rule("tool-" + name, func_parser);
return;
}
const auto & properties = params.at("properties");
std::set<std::string> required;
if (params.contains("required") && params.at("required").is_array()) {
params.at("required").get_to(required);
}
// Build per-argument parsers, sorted alphabetically (matching template's dictsort)
struct arg_entry {
std::string param_name;
common_peg_parser parser;
};
std::vector<arg_entry> arg_entries;
for (const auto & [param_name, param_schema] : properties.items()) {
std::string type = "object";
if (param_schema.contains("type")) {
const auto & type_v = param_schema.at("type");
if (type_v.is_string()) {
type_v.get_to(type);
} else if (type_v.is_array()) {
// Handle nullable types like ["string", "null"]
for (const auto & t : type_v) {
if (t.is_string() && t.get<std::string>() != "null") {
type = t.get<std::string>();
break;
}
}
}
}
// Infer string type from enum values when type is unspecified
if (type == "object" && param_schema.contains("enum")) {
const auto & enum_vals = param_schema.at("enum");
if (enum_vals.is_array()) {
for (const auto & v : enum_vals) {
if (v.is_string()) {
type = "string";
break;
}
}
}
}
common_peg_parser value_parser = p.eps();
if (type == "string") {
// String values are delimited by <|"|>...<|"|>
value_parser =
p.literal(QUOTE) +
p.tool_arg_string_value(p.schema(p.until(QUOTE),
"tool-" + name + "-arg-" + param_name + "-schema", param_schema, true)) +
p.literal(QUOTE);
} else if (type == "number" || type == "integer") {
value_parser = p.tool_arg_value(g4.gemma4_number());
} else if (type == "boolean") {
value_parser = p.tool_arg_value(g4.gemma4_bool());
} else if (type == "null") {
value_parser = p.tool_arg_value(g4.gemma4_null());
} else if (type == "object") {
value_parser = p.tool_arg_value(g4.gemma4_dict());
} else if (type == "array") {
value_parser = p.tool_arg_value(g4.gemma4_array());
} else {
value_parser = p.tool_arg_value(g4.gemma4_value());
}
auto arg = p.tool_arg(
p.tool_arg_open(p.tool_arg_name(p.literal(param_name)) + p.literal(":")) +
value_parser +
p.tool_arg_close(p.eps()));
arg_entries.push_back({param_name, p.rule("tool-" + name + "-arg-" + param_name, arg)});
}
// Sort alphabetically to match Jinja's dictsort
std::sort(arg_entries.begin(), arg_entries.end(), [](const auto & a, const auto & b) {
return a.param_name < b.param_name;
});
// Build arg sequence: any arg, then zero-or-more comma-separated additional args
common_peg_parser args_seq = p.eps();
if (!arg_entries.empty()) {
common_peg_parser any_arg = p.choice();
for (auto & entry : arg_entries) {
any_arg |= entry.parser;
}
args_seq = p.optional(
any_arg + p.repeat(p.literal(",") + any_arg, 0, (int) arg_entries.size() - 1));
}
// Full parser: call:name{args}
auto func_parser = p.atomic(
p.tool_open(p.literal(function.name_prefix) + p.tool_name(p.literal(name)) + p.literal("{")) +
p.tool_args(args_seq) +
p.tool_close(p.literal("}")));
tool_choice |= p.rule("tool-" + name, func_parser);
});
// Wrap each call in <|tool_call>...</tool_call|>
auto wrapped_call = p.literal(format.per_call_start) + tool_choice + p.literal(format.per_call_end);
common_peg_parser tool_calls = p.eps();
if (inputs.parallel_tool_calls) {
tool_calls = p.trigger_rule("tool-call", wrapped_call + p.zero_or_more(p.space() + wrapped_call));
} else {
tool_calls = p.trigger_rule("tool-call", wrapped_call);
}
if (!force_tools) {
tool_calls = p.optional(tool_calls);
}
auto content_before_tools = p.until_one_of({ format.per_call_start, ctx.reasoning->start });
return ctx.reasoning_parser +
(force_tools ? p.eps() : p.optional(p.content(content_before_tools) + p.optional(ctx.reasoning_parser))) +
tool_calls + p.end();
}
} // namespace autoparser
+7 -4
View File
@@ -145,7 +145,6 @@ enum class tool_format {
JSON_NATIVE, // Pure JSON: {"name": "X", "arguments": {...}}
TAG_WITH_JSON, // Tag-based with JSON args: <function=X>{...}</function>
TAG_WITH_TAGGED, // Tag-based with tagged args: <param=key>value</param>
TAG_WITH_GEMMA4_DICT, // Gemma4 custom dict: <|tool_call>call:name{key:<|"|>val<|"|>}<tool_call|>
};
inline std::ostream & operator<<(std::ostream & os, const tool_format & format) {
@@ -158,8 +157,6 @@ inline std::ostream & operator<<(std::ostream & os, const tool_format & format)
return os << "TAG_WITH_JSON";
case tool_format::TAG_WITH_TAGGED:
return os << "TAG_WITH_TAGGED";
case tool_format::TAG_WITH_GEMMA4_DICT:
return os << "TAG_WITH_GEMMA4_DICT";
default:
return os << "UNKNOWN";
}
@@ -356,7 +353,13 @@ struct analyze_tools : analyze_base {
common_peg_parser build_tool_parser_json_native(parser_build_context & ctx) const;
common_peg_parser build_tool_parser_tag_json(parser_build_context & ctx) const;
common_peg_parser build_tool_parser_tag_tagged(parser_build_context & ctx) const;
common_peg_parser build_tool_parser_tag_gemma4_dict(parser_build_context & ctx) const;
// Shared helper: builds func_parser from open+call_id+args, handling atomic wrapping and close.
// atomic_peek: if present, used as the peek expression in the third atomicity branch.
common_peg_parser build_func_parser(common_chat_peg_builder & p, const std::string & name,
const common_peg_parser & call_id_section, bool have_call_id,
const common_peg_parser & args,
std::optional<common_peg_parser> atomic_peek) const;
};
// ============================================================================
+35 -41
View File
@@ -25,6 +25,9 @@ static const std::string ARG_SECOND = "BB_ARG_SND_BB";
static const std::string USER_MSG = "U_USER_MSG Hello END_U";
static const std::string ASSISTANT_MSG = "A_ASST_MSG I can help END_A";
static const std::string THINKING_CONTENT = "REASON_PART I am thinking END_R";
static const std::string CALL_ID_001 = "call00001";
static const std::string CALL_ID_002 = "call00002";
static const std::string CALL_ID_999 = "call99999";
static std::vector<std::function<void(const common_chat_template & tmpl, autoparser &)>> workarounds(
{ // Old reasoning Qwen templates - they don't really display reasoning content, but we still want to
@@ -92,34 +95,6 @@ static std::vector<std::function<void(const common_chat_template & tmpl, autopar
LOG_DBG(ANSI_ORANGE "[Patch: Functionary 3.1]\n" ANSI_RESET);
}
},
// Gemma4 - custom dict format: <|tool_call>call:name{key:<|"|>val<|"|>}<tool_call|>
[](const common_chat_template & tmpl, autoparser & analysis) -> void {
if (tmpl.src.find("'<|tool_call>call:'") != std::string::npos) {
analysis.tools.format.mode = tool_format::TAG_WITH_GEMMA4_DICT;
analysis.tools.format.per_call_start = "<|tool_call>";
analysis.tools.format.per_call_end = "<tool_call|>";
analysis.tools.format.section_start = "";
analysis.tools.format.section_end = "";
analysis.tools.function.name_prefix = "call:";
analysis.tools.function.name_suffix = "";
analysis.tools.arguments.start = "{";
analysis.tools.arguments.end = "}";
analysis.tools.arguments.name_prefix = "";
analysis.tools.arguments.name_suffix = ":";
analysis.tools.arguments.separator = ",";
analysis.reasoning.mode = reasoning_mode::TAG_BASED;
analysis.reasoning.start = "<|channel>thought";
analysis.reasoning.end = "<channel|>";
analysis.preserved_tokens.clear();
analysis.preserved_tokens.push_back("<|tool_call>");
analysis.preserved_tokens.push_back("<tool_call|>");
analysis.preserved_tokens.push_back("<|tool_response>");
analysis.preserved_tokens.push_back("<tool_response|>");
analysis.preserved_tokens.push_back("<|\"|>");
analysis.preserved_tokens.push_back("<|turn>");
LOG_DBG(ANSI_ORANGE "[Patch: Gemma4]\n" ANSI_RESET);
}
},
// DeepSeek-R1-Distill-Qwen
[](const common_chat_template & tmpl, autoparser & analysis) -> void {
if (tmpl.src.find(
@@ -131,6 +106,7 @@ static std::vector<std::function<void(const common_chat_template & tmpl, autopar
analysis.tools.function.name_prefix = "<tool▁sep>";
analysis.tools.format.per_call_end = "<tool▁call▁end>";
analysis.tools.function.close = "```";
LOG_DBG(ANSI_ORANGE "[Patch: DeepSeek-R1-Distill-Qwen]\n" ANSI_RESET);
}
}
});
@@ -158,7 +134,7 @@ static json user_msg = json{
{ "content", USER_MSG }
};
static json build_tool_call(const std::string & name, const json & args, const std::string & id = "call00001") {
static json build_tool_call(const std::string & name, const json & args, const std::string & id = CALL_ID_001) {
return json{
{ "id", id },
{ "type", "function" },
@@ -166,17 +142,17 @@ static json build_tool_call(const std::string & name, const json & args, const s
};
}
static json first_tool_call_zero_args = build_tool_call(FUN_FIRST, json::object(), "call00001");
static json first_tool_call_one_arg = build_tool_call(FUN_FIRST, {{ ARG_FIRST, "XXXX" }}, "call00001");
static json first_tool_call_one_arg_other_val = build_tool_call(FUN_FIRST, {{ ARG_FIRST, "YYYY" }}, "call00001");
static json first_tool_call_other_arg = build_tool_call(FUN_FIRST, {{ ARG_SECOND, "YYYY" }}, "call00001");
static json first_tool_call_zero_args = build_tool_call(FUN_FIRST, json::object(), CALL_ID_001);
static json first_tool_call_one_arg = build_tool_call(FUN_FIRST, {{ ARG_FIRST, "XXXX" }}, CALL_ID_001);
static json first_tool_call_one_arg_other_val = build_tool_call(FUN_FIRST, {{ ARG_FIRST, "YYYY" }}, CALL_ID_001);
static json first_tool_call_other_arg = build_tool_call(FUN_FIRST, {{ ARG_SECOND, "YYYY" }}, CALL_ID_001);
static json first_tool_call =
build_tool_call(FUN_FIRST, json{{ ARG_FIRST, "XXXX" }, { ARG_SECOND, "YYYY" }}, "call00001");
build_tool_call(FUN_FIRST, json{{ ARG_FIRST, "XXXX" }, { ARG_SECOND, "YYYY" }}, CALL_ID_001);
static json second_tool_call =
build_tool_call(FUN_SECOND, json{ { ARG_FIRST, "XXXX" }, { ARG_SECOND, "YYYY" }}, "call00002");
build_tool_call(FUN_SECOND, json{ { ARG_FIRST, "XXXX" }, { ARG_SECOND, "YYYY" }}, CALL_ID_002);
static json first_tool_call_alt_id =
build_tool_call(FUN_FIRST, json{{ ARG_FIRST, "XXXX" }, { ARG_SECOND, "YYYY" }}, "call99999");
build_tool_call(FUN_FIRST, json{{ ARG_FIRST, "XXXX" }, { ARG_SECOND, "YYYY" }}, CALL_ID_999);
template <typename T>
static std::string mode_to_str(T mode) {
@@ -215,6 +191,11 @@ void autoparser::analyze_template(const common_chat_template & tmpl) {
LOG_DBG("func_name_prefix: '%s'\n", tools.function.name_prefix.c_str());
LOG_DBG("func_name_suffix: '%s'\n", tools.function.name_suffix.c_str());
LOG_DBG("func_close: '%s'\n", tools.function.close.c_str());
LOG_DBG("call_id_prefix: '%s'\n", tools.call_id.prefix.c_str());
LOG_DBG("call_id_suffix: '%s'\n", tools.call_id.suffix.c_str());
LOG_DBG("call_id_pos: '%s'\n", mode_to_str(tools.call_id.pos).c_str());
LOG_DBG("args_start: '%s'\n", tools.arguments.start.c_str());
LOG_DBG("args_end: '%s'\n", tools.arguments.end.c_str());
LOG_DBG("arg_name_prefix: '%s'\n", tools.arguments.name_prefix.c_str());
LOG_DBG("arg_name_suffix: '%s'\n", tools.arguments.name_suffix.c_str());
LOG_DBG("arg_value_prefix: '%s'\n", tools.arguments.value_prefix.c_str());
@@ -583,12 +564,15 @@ analyze_tools::analyze_tools(const common_chat_template & tmpl,
if (caps.supports_parallel_tool_calls) {
check_per_call_markers();
}
LOG_DBG(ANSI_ORANGE "Phase 3a: Function call analysis\n" ANSI_RESET);
extract_function_markers();
LOG_DBG(ANSI_ORANGE "Phase 3b: Argument analysis\n" ANSI_RESET);
if (format.mode == tool_format::TAG_WITH_TAGGED) {
analyze_arguments();
}
extract_argument_separator();
extract_args_markers();
LOG_DBG(ANSI_ORANGE "Phase 3c: Call id analysis\n" ANSI_RESET);
extract_call_id_markers();
}
}
@@ -979,8 +963,6 @@ void analyze_tools::extract_function_markers() {
}
void analyze_tools::analyze_arguments() {
LOG_DBG(ANSI_ORANGE "Phase 4: Argument analysis\n" ANSI_RESET);
extract_argument_name_markers();
extract_argument_value_markers();
}
@@ -1189,7 +1171,7 @@ void analyze_tools::extract_args_markers() {
const auto & diff = comparison->diff;
if (format.mode != tool_format::JSON_NATIVE) {
if (format.mode == tool_format::JSON_NATIVE) {
std::string prefix_marker = !format.section_start.empty() ? format.section_start : format.per_call_start;
std::string suffix_marker = !format.section_end.empty() ? format.section_end : format.per_call_end;
// these might happen earlier in the tools section as an example or somewhere else, so we need to find the closest ones
@@ -1211,6 +1193,10 @@ void analyze_tools::extract_args_markers() {
if (find_fun != std::string::npos) {
args_start = args_start.substr(find_fun + FUN_FIRST.size(), args_start.size() - find_fun - FUN_FIRST.size());
}
size_t find_call_id = args_start.find(CALL_ID_001);
if (find_call_id != std::string::npos) {
args_start = args_start.substr(find_call_id + CALL_ID_001.size(), args_start.size() - find_call_id - CALL_ID_001.size());
}
arguments.start = args_start;
arguments.end = args_end;
}
@@ -1250,8 +1236,8 @@ void analyze_tools::extract_call_id_markers() {
return;
}
std::string id_value_1 = "call00001";
std::string id_value_2 = "call99999";
std::string id_value_1 = CALL_ID_001;
std::string id_value_2 = CALL_ID_999;
size_t common_id_prefix_len = 0;
for (size_t i = 0; i < std::min(id_value_1.length(), id_value_2.length()); i++) {
@@ -1350,6 +1336,14 @@ void analyze_tools::extract_call_id_markers() {
call_id.suffix = find_first_marker(before_func);
}
if (call_id.prefix == arguments.end) {
call_id.prefix = "";
}
if (call_id.suffix == arguments.start) {
call_id.suffix = "";
}
// When call_id is detected, per_call_end may have been incorrectly set to include
// the call_id_suffix and sample args. Clear it if it starts with call_id_suffix.
if (call_id.pos != call_id_position::NONE && !call_id.suffix.empty() &&
+140 -82
View File
@@ -75,84 +75,6 @@ static std::string escape_json_string_inner(const std::string & s) {
return escaped;
}
static const std::string GEMMA4_QUOTE = "<|\"|>";
static std::string normalize_gemma4_to_json(const std::string & input) {
std::string result;
result.reserve(input.size() * 2);
enum Ctx { DICT, ARRAY };
std::vector<Ctx> ctx;
auto is_ws = [](char c) { return c == ' ' || c == '\t' || c == '\n' || c == '\r'; };
auto skip_ws = [&](size_t & pos) {
while (pos < input.size() && is_ws(input[pos])) {
result += input[pos++];
}
};
auto quote_unquoted_key = [&](size_t & pos) {
if (pos < input.size() && input[pos] != '"' && input[pos] != '}') {
result += '"';
while (pos < input.size() && input[pos] != ':' && !is_ws(input[pos])) {
result += input[pos++];
}
result += '"';
skip_ws(pos);
}
};
size_t i = 0;
while (i < input.size()) {
if (i + GEMMA4_QUOTE.size() <= input.size() &&
input.compare(i, GEMMA4_QUOTE.size(), GEMMA4_QUOTE) == 0) {
result += '"';
i += GEMMA4_QUOTE.size();
continue;
}
char c = input[i];
if (c == '{') {
result += c;
ctx.push_back(DICT);
++i;
skip_ws(i);
quote_unquoted_key(i);
continue;
}
if (c == '}') {
result += c;
if (!ctx.empty()) ctx.pop_back();
++i;
continue;
}
if (c == '[') {
result += c;
ctx.push_back(ARRAY);
++i;
continue;
}
if (c == ']') {
result += c;
if (!ctx.empty()) ctx.pop_back();
++i;
continue;
}
if (c == ',' && !ctx.empty() && ctx.back() == DICT) {
result += c;
++i;
skip_ws(i);
quote_unquoted_key(i);
continue;
}
result += c;
++i;
}
return result;
}
// Convert Python-style single-quoted strings to JSON double-quoted strings
// Only converts outer string delimiters, properly handling escape sequences:
// - {'key': 'value'} -> {"key": "value"}
@@ -296,10 +218,6 @@ std::string common_chat_peg_mapper::normalize_container_value(const std::string
return normalize_quotes_to_json(input);
}
std::string common_chat_peg_gemma4_mapper::normalize_container_value(const std::string & input) {
return normalize_quotes_to_json(normalize_gemma4_to_json(input));
}
void common_chat_peg_mapper::from_ast(const common_peg_ast_arena & arena,
const common_peg_parse_result & parse_result_arg) {
arena.visit(parse_result_arg, [this](const common_peg_ast_node & node) { map(node); });
@@ -947,3 +865,143 @@ common_peg_parser common_chat_peg_builder::standard_json_tools(
return force_tool_calls ? section : optional(section);
}
void common_chat_peg_gemma4_mapper::from_ast(const common_peg_ast_arena & arena, const common_peg_parse_result & result) {
for (const auto & node : result.nodes) {
visit(arena, node);
}
}
static std::string gemma4_to_json(const common_peg_ast_arena & arena, common_peg_ast_id id) {
const auto & node = arena.get(id);
if (node.text.empty()) {
return "";
}
if (node.rule == "gemma4-number" || node.rule == "gemma4-bool" || node.rule == "gemma4-null") {
return std::string(node.text);
}
if (node.rule == "gemma4-string-content") {
return escape_json_string_inner(std::string(node.text));
}
if (node.rule == "gemma4-string") {
std::string result = "\"";
if (!node.children.empty()) {
result += gemma4_to_json(arena, node.children[0]);
if (!node.is_partial) {
result += "\"";
}
}
return result;
}
if (node.rule == "gemma4-array") {
std::string result = "[";
bool add_comma = false;
for (auto child_id : node.children) {
if (add_comma) {
result += ',';
}
add_comma = true;
result += gemma4_to_json(arena, child_id);
}
if (!node.is_partial) {
result += ']';
}
return result;
}
if (node.rule == "gemma4-dict-key-name") {
return std::string(node.text);
}
if (node.rule == "gemma4-dict-key") {
std::string result = "\"";
if (!node.children.empty()) {
result += escape_json_string_inner(gemma4_to_json(arena, node.children[0]));
}
if (!node.is_partial) {
result += "\":";
}
return result;
}
if (node.rule == "gemma4-dict-kv") {
std::string result;
for (auto child_id : node.children) {
result += gemma4_to_json(arena, child_id);
}
return result;
}
if (node.rule == "gemma4-dict") {
std::string result = "{";
bool add_comma = false;
for (auto child_id : node.children) {
if (add_comma) {
result += ',';
}
add_comma = true;
result += gemma4_to_json(arena, child_id);
}
if (!node.is_partial) {
result += '}';
}
return result;
}
if (node.rule == "gemma4-value") {
if (!node.children.empty()) {
return gemma4_to_json(arena, node.children[0]);
}
return "";
}
return "";
}
void common_chat_peg_gemma4_mapper::visit(const common_peg_ast_arena & arena, common_peg_ast_id id) {
const auto & node = arena.get(id);
if (node.tag == "reasoning") {
result.reasoning_content += std::string(node.text);
return;
}
if (node.tag == "content") {
result.content += std::string(node.text);
return;
}
if (node.tag == "tool") {
auto name_id = arena.find_by_tag(node, "tool-name");
auto args_id = arena.find_by_tag(node, "tool-args");
if (name_id != COMMON_PEG_INVALID_AST_ID && args_id != COMMON_PEG_INVALID_AST_ID) {
const auto & name_node = arena.get(name_id);
const auto & args_node = arena.get(args_id);
if (!name_node.is_partial) {
common_chat_tool_call call;
call.name = std::string(name_node.text);
if (!args_node.children.empty()) {
call.arguments = gemma4_to_json(arena, args_node.children[0]);
}
result.tool_calls.push_back(call);
}
}
return;
}
for (auto child_id : node.children) {
visit(arena, child_id);
}
}
+3 -2
View File
@@ -35,8 +35,9 @@ class common_chat_peg_mapper {
class common_chat_peg_gemma4_mapper : public common_chat_peg_mapper {
public:
common_chat_peg_gemma4_mapper(common_chat_msg & msg) : common_chat_peg_mapper(msg) {}
protected:
std::string normalize_container_value(const std::string & input) override;
virtual void from_ast(const common_peg_ast_arena & arena, const common_peg_parse_result & result);
private:
void visit(const common_peg_ast_arena & arena, common_peg_ast_id id);
};
struct content_structure;
+268 -41
View File
@@ -1077,6 +1077,131 @@ static common_chat_params common_chat_params_init_gpt_oss(const common_chat_temp
return data;
}
static common_chat_params common_chat_params_init_gemma4(const common_chat_template & tmpl,
const autoparser::generation_params & inputs) {
common_chat_params data;
data.prompt = common_chat_template_direct_apply_impl(tmpl, inputs);
data.format = COMMON_CHAT_FORMAT_PEG_GEMMA4;
data.supports_thinking = true;
data.preserved_tokens = {
"<|channel>",
"<channel|>",
"<|tool_call>",
"<tool_call|>",
"<|turn>",
};
auto has_tools = inputs.tools.is_array() && !inputs.tools.empty();
auto has_response_format = !inputs.json_schema.is_null() && inputs.json_schema.is_object();
auto include_grammar = has_response_format || (has_tools && inputs.tool_choice != COMMON_CHAT_TOOL_CHOICE_NONE);
auto extract_reasoning = inputs.reasoning_format != COMMON_REASONING_FORMAT_NONE;
auto parser = build_chat_peg_parser([&](common_chat_peg_builder & p) {
auto start = p.rule("start", p.prefix(inputs.generation_prompt, "<|channel>"));
if (extract_reasoning) {
p.rule("thought", p.literal("<|channel>thought\n") + p.reasoning(p.until("<channel|>")) + p.literal("<channel|>"));
} else {
p.rule("thought", p.content(p.literal("<|channel>thought\n") + p.until("<channel|>") + p.literal("<channel|>")));
}
auto thought = (p.peek(p.literal("<|channel>")) + p.ref("thought")) | p.negate(p.literal("<|channel>"));
if (has_response_format) {
auto response_format = p.literal("```json") <<
p.content(p.schema(p.json(), "response-format-schema", inputs.json_schema)) <<
p.literal("```");
return start + p.optional(thought) + response_format;
}
if (has_tools && inputs.tool_choice != COMMON_CHAT_TOOL_CHOICE_NONE) {
// Gemma4 tool calling syntax
// Rules should match traversal logic in gemma4_to_json()
p.rule("gemma4-string-content", p.until("<|\"|>"));
p.rule("gemma4-string", p.literal("<|\"|>") + p.ref("gemma4-string-content") + p.literal("<|\"|>"));
p.rule("gemma4-bool", p.json_bool());
p.rule("gemma4-null", p.json_null());
p.rule("gemma4-number", p.json_number());
p.rule("gemma4-dict-key", p.rule("gemma4-dict-key-name", p.until(":")) + p.literal(":"));
p.rule("gemma4-dict-kv", p.ref("gemma4-dict-key") + p.space() + p.ref("gemma4-value"));
p.rule("gemma4-dict", [&]() {
auto ws = p.space();
auto member = p.ref("gemma4-dict-kv");
auto members = p.sequence({member, p.zero_or_more(p.sequence({p.literal(","), ws, member}))});
return p.sequence({
p.literal("{"), ws,
p.choice({p.literal("}"), p.sequence({members, ws, p.literal("}")})})
});
});
p.rule("gemma4-array", [&]() {
auto ws = p.space();
auto value = p.ref("gemma4-value");
auto elements = p.sequence({value, p.zero_or_more(p.sequence({p.literal(","), ws, value}))});
return p.sequence({
p.literal("["), ws,
p.choice({p.literal("]"), p.sequence({elements, ws, p.literal("]")})})
});
});
p.rule("gemma4-value", [&]() {
return p.choice({
p.ref("gemma4-string"), p.ref("gemma4-dict"), p.ref("gemma4-array"),
p.ref("gemma4-number"), p.ref("gemma4-bool"), p.ref("gemma4-null")
});
});
auto tool_choice = p.choice();
foreach_function(inputs.tools, [&](const json & tool) {
const auto & function = tool.at("function");
std::string name = function.at("name");
// TODO @aldehir : need to extend json-schema-to-grammar to produce more than JSON rules
// const auto & params = function.at("parameters");
tool_choice |= p.rule("tool-" + name, p.tool(p.sequence({
p.tool_open(p.tool_name(p.literal(name)) + p.peek(p.literal("{"))),
p.tool_args(p.ref("gemma4-dict")),
})));
});
auto tool_call = p.trigger_rule("tool-call", p.repeat(
"<|tool_call>call:" + tool_choice + "<tool_call|>",
/* min = */ inputs.tool_choice == COMMON_CHAT_TOOL_CHOICE_REQUIRED ? 1 : 0,
/* max = */ inputs.parallel_tool_calls ? -1 : 1
));
auto content = p.rule("content", p.content(p.until_one_of({"<|channel>", "<|tool_call>"})));
auto message = p.rule("message", thought + content);
return start + p.zero_or_more(message) + tool_call;
}
auto content = p.rule("content", p.content(p.until("<|channel>")));
auto message = p.rule("message", thought + content);
return start + p.one_or_more(message);
});
data.parser = parser.save();
if (include_grammar) {
data.grammar_lazy = !(has_response_format || (has_tools && inputs.tool_choice == COMMON_CHAT_TOOL_CHOICE_REQUIRED));
data.grammar = build_grammar([&](const common_grammar_builder & builder) {
foreach_function(inputs.tools, [&](const json & tool) {
const auto & function = tool.at("function");
auto schema = function.at("parameters");
builder.resolve_refs(schema);
});
parser.build_grammar(builder, data.grammar_lazy);
});
data.grammar_triggers = {
{ COMMON_GRAMMAR_TRIGGER_TYPE_WORD, "<|tool_call>" },
};
}
return data;
}
// Functionary v3.2 - uses recipient-based format: >>>recipient\n{content}
static common_chat_params common_chat_params_init_functionary_v3_2(const common_chat_template & tmpl,
const autoparser::generation_params & inputs) {
@@ -1556,46 +1681,146 @@ static void requires_non_null_content(json & messages) {
}
// Gemma4 uses a custom tool_responses field instead of role:tool messages.
// Convert consecutive role:tool messages into a single user message with tool_responses.
//
// This will transform a sequence of messages:
// assistant(tool_call+) -> tool+ -> assistant(content)
//
// Into a single assistant message containing a tool_responses field:
// assistant(content + tool_call + tool_responses)
//
// This is necessary for the Gemma4 chat template to properly format the prompt.
// See https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4
struct gemma4_model_turn_builder {
json & messages;
size_t pos;
json tool_calls = json::array();
json tool_responses = json::array();
json content;
json reasoning_content;
gemma4_model_turn_builder(json & msgs, size_t pos) : messages(msgs), pos(pos) {}
void collect() {
// Collect the first assistant message
auto & msg = messages[pos];
if (msg.contains("reasoning_content") && msg.at("reasoning_content").is_string()) {
// According to the prompt formatting guide, we need to preserve reasoning_content
// between function calls. The current chat templates do not support this, but we will do it anyway.
reasoning_content = msg.at("reasoning_content");
}
for (auto & tc : msg.at("tool_calls")) {
tool_calls.push_back(tc);
}
pos++;
// Collect tool call results
while (pos < messages.size() && messages[pos].value("role", "") == "tool") {
collect_result(messages[pos]);
pos++;
}
// Check if the next assistant message is the final message
if (pos < messages.size() && messages[pos].value("role", "") == "assistant") {
auto & next = messages[pos];
if (!has_tool_calls(next) && has_content(next)) {
content = next.at("content");
pos++;
}
}
}
void collect_result(const json & curr) {
json response;
if (curr.contains("content")) {
const auto & content = curr.at("content");
if (content.is_string()) {
// Try to parse the content as JSON; fall back to raw string
try {
response = json::parse(content.get<std::string>());
} catch (...) {
response = content;
}
} else {
response = content;
}
}
std::string name;
// Match name with corresponding tool call
size_t idx = tool_responses.size();
if (idx < tool_calls.size()) {
auto & tc = tool_calls[idx];
if (tc.contains("function")) {
name = tc.at("function").value("name", "");
}
}
// Fallback to the tool call id
if (name.empty()) {
name = curr.value("tool_call_id", "");
}
tool_responses.push_back({{"name", name}, {"response", response}});
}
json build() {
collect();
json msg = {
{"role", "assistant"},
{"tool_calls", tool_calls},
};
if (!tool_responses.empty()) {
msg["tool_responses"] = tool_responses;
}
if (!content.is_null()) {
msg["content"] = content;
}
if (!reasoning_content.is_null()) {
msg["reasoning_content"] = reasoning_content;
}
return msg;
}
static bool has_content(const json & msg) {
if (!msg.contains("content") || msg.at("content").is_null()) {
return false;
}
const auto & content = msg.at("content");
if (content.is_string() && !content.get<std::string>().empty()) {
return true;
}
if (content.is_array() && !content.empty()) {
return true;
}
return false;
}
static bool has_tool_calls(const json & msg) {
return msg.contains("tool_calls") && msg.at("tool_calls").is_array() && !msg.at("tool_calls").empty();
}
};
static void convert_tool_responses_gemma4(json & messages) {
json result = json::array();
size_t i = 0;
while (i < messages.size()) {
if (messages[i].contains("role") && messages[i].at("role") == "tool") {
json tool_responses = json::array();
while (i < messages.size() &&
messages[i].contains("role") &&
messages[i].at("role") == "tool") {
const auto & tool_msg = messages[i];
std::string name;
if (tool_msg.contains("tool_call_id") && tool_msg.at("tool_call_id").is_string()) {
name = tool_msg.at("tool_call_id");
} else if (tool_msg.contains("name") && tool_msg.at("name").is_string()) {
name = tool_msg.at("name");
}
json response;
if (tool_msg.contains("content")) {
const auto & content = tool_msg.at("content");
if (content.is_string()) {
// Try to parse the content as JSON; fall back to raw string
try {
response = json::parse(content.get<std::string>());
} catch (...) {
response = content;
}
} else {
response = content;
}
}
tool_responses.push_back({{"name", name}, {"response", response}});
i++;
}
result.push_back({{"role", "user"}, {"tool_responses", tool_responses}});
} else {
result.push_back(messages[i]);
auto & msg = messages[i];
if (msg.value("role", "") != "assistant" || !msg.contains("tool_calls") ||
!msg.at("tool_calls").is_array() || msg.at("tool_calls").empty()) {
result.push_back(msg);
i++;
continue;
}
gemma4_model_turn_builder builder(messages, i);
result.push_back(builder.build());
i = builder.pos;
}
messages = result;
}
@@ -1631,10 +1856,10 @@ static json common_chat_extra_context() {
return ctx;
}
static std::optional<common_chat_params> try_specialized_template(
std::optional<common_chat_params> common_chat_try_specialized_template(
const common_chat_template & tmpl,
const std::string & src,
const autoparser::generation_params & params) {
autoparser::generation_params & params) {
// Ministral/Mistral Large 3 - uses special reasoning structure fixes, can't use autoparser
// Note: Mistral Small 3.2 uses [CALL_ID] which Ministral doesn't have, so we can distinguish them
if (src.find("[SYSTEM_PROMPT]") != std::string::npos && src.find("[TOOL_CALLS]") != std::string::npos &&
@@ -1687,6 +1912,12 @@ static std::optional<common_chat_params> try_specialized_template(
return common_chat_params_init_gigachat_v3(tmpl, params);
}
// Gemma4 format detection
if (src.find("'<|tool_call>call:'") != std::string::npos) {
workaround::convert_tool_responses_gemma4(params.messages);
return common_chat_params_init_gemma4(tmpl, params);
}
return std::nullopt;
}
@@ -1727,16 +1958,12 @@ static common_chat_params common_chat_templates_apply_jinja(const struct common_
workaround::func_args_not_string(params.messages);
}
if (src.find("'<|tool_call>call:'") != std::string::npos) {
workaround::convert_tool_responses_gemma4(params.messages);
}
params.add_generation_prompt = false;
std::string no_gen_prompt = common_chat_template_direct_apply_impl(tmpl, params);
params.add_generation_prompt = true;
std::string gen_prompt = common_chat_template_direct_apply_impl(tmpl, params);
auto diff = calculate_diff_split(no_gen_prompt, gen_prompt);
params.generation_prompt = diff.right;
params.generation_prompt = diff.right + diff.suffix;
params.add_generation_prompt = inputs.add_generation_prompt;
@@ -1778,7 +2005,7 @@ static common_chat_params common_chat_templates_apply_jinja(const struct common_
return data;
}
if (auto result = try_specialized_template(tmpl, src, params)) {
if (auto result = common_chat_try_specialized_template(tmpl, src, params)) {
result->generation_prompt = params.generation_prompt;
return *result;
}
+5
View File
@@ -270,3 +270,8 @@ std::map<std::string, bool> common_chat_templates_get_caps(const common_chat_tem
std::string common_chat_template_direct_apply(
const common_chat_template & tmpl,
const autoparser::generation_params & inputs);
std::optional<common_chat_params> common_chat_try_specialized_template(
const common_chat_template & tmpl,
const std::string & src,
autoparser::generation_params & params);
+3 -2
View File
@@ -579,8 +579,9 @@ struct common_params {
int32_t n_threads_http = -1; // number of threads to process HTTP requests (TODO: support threadpool)
int32_t n_cache_reuse = 0; // min chunk size to reuse from the cache via KV shifting
bool cache_prompt = true; // whether to enable prompt caching
int32_t n_ctx_checkpoints = 32; // max number of context checkpoints per slot
int32_t checkpoint_every_nt = 8192; // make a checkpoint every n tokens during prefill
bool clear_idle = true; // save and clear idle slots upon starting a new task
int32_t n_ctx_checkpoints = 32; // max number of context checkpoints per slot
int32_t checkpoint_every_nt = 8192; // make a checkpoint every n tokens during prefill
int32_t cache_ram_mib = 8192; // -1 = no limit, 0 - disable, 1 = 1 MiB, etc.
std::string hostname = "127.0.0.1";
+5 -4
View File
@@ -700,13 +700,13 @@ namespace console {
std::vector<std::string> entries;
size_t viewing_idx = SIZE_MAX;
std::string backup_line; // current line before viewing history
void add(const std::string & line) {
void add(std::string_view line) {
if (line.empty()) {
return;
}
// avoid duplicates with the last entry
if (entries.empty() || entries.back() != line) {
entries.push_back(line);
entries.emplace_back(line);
}
// also clear viewing state
end_viewing();
@@ -1031,11 +1031,12 @@ namespace console {
if (!end_of_stream && !line.empty()) {
// remove the trailing newline for history storage
std::string_view hline = line;
if (!line.empty() && line.back() == '\n') {
line.pop_back();
hline.remove_suffix(1);
}
// TODO: maybe support multiline history entries?
history.add(line);
history.add(hline);
}
fflush(out);
+6 -3
View File
@@ -596,9 +596,12 @@ static hf_cache::hf_file find_best_model(const hf_cache::hf_files & files,
}
}
for (const auto & f : files) {
if (gguf_filename_is_model(f.path)) {
return f;
// fallback to first available model only if tag is empty
if (tag.empty()) {
for (const auto & f : files) {
if (gguf_filename_is_model(f.path)) {
return f;
}
}
}
+32
View File
@@ -256,6 +256,38 @@ static std::pair<std::vector<common_peg_chars_parser::char_range>, bool> parse_c
return {ranges, negated};
}
common_peg_ast_id common_peg_ast_arena::find_by_tag(const common_peg_ast_node & parent, const std::string & tag, int max_depth) const {
for (auto child_id : parent.children) {
const auto & child = get(child_id);
if (child.tag == tag) {
return child_id;
}
if (max_depth > 1) {
auto result = find_by_tag(child, tag, max_depth - 1);
if (result != COMMON_PEG_INVALID_AST_ID) {
return result;
}
}
}
return COMMON_PEG_INVALID_AST_ID;
}
common_peg_ast_id common_peg_ast_arena::find_by_rule(const common_peg_ast_node & parent, const std::string & rule, int max_depth) const {
for (auto child_id : parent.children) {
const auto & child = get(child_id);
if (child.rule == rule) {
return child_id;
}
if (max_depth > 1) {
auto result = find_by_rule(child, rule, max_depth - 1);
if (result != COMMON_PEG_INVALID_AST_ID) {
return result;
}
}
}
return COMMON_PEG_INVALID_AST_ID;
}
void common_peg_ast_arena::visit(common_peg_ast_id id, const common_peg_ast_visitor & visitor) const {
if (id == COMMON_PEG_INVALID_AST_ID) {
return;
+3
View File
@@ -106,6 +106,9 @@ class common_peg_ast_arena {
const common_peg_ast_node & get(common_peg_ast_id id) const { return nodes_.at(id); }
common_peg_ast_id find_by_tag(const common_peg_ast_node & parent, const std::string & tag, int max_depth = 3) const;
common_peg_ast_id find_by_rule(const common_peg_ast_node & parent, const std::string & tag, int max_depth = 3) const;
size_t size() const { return nodes_.size(); }
void clear() { nodes_.clear(); }
+178 -15
View File
@@ -2219,10 +2219,10 @@ class MmprojModel(ModelBase):
self.image_size = self.find_vparam(["image_size"])
self.gguf_writer.add_vision_image_size(self.image_size)
self.gguf_writer.add_vision_patch_size(self.find_vparam(["patch_size"]))
self.gguf_writer.add_vision_embedding_length(self.find_vparam(["hidden_size", "vt_hidden_size"]))
self.gguf_writer.add_vision_embedding_length(self.find_vparam(["hidden_size", "width", "vt_hidden_size"]))
self.gguf_writer.add_vision_feed_forward_length(self.find_vparam(["intermediate_size", "vt_intermediate_size"]))
self.gguf_writer.add_vision_block_count(self.find_vparam(self.n_block_keys))
self.gguf_writer.add_vision_head_count(self.find_vparam(["num_attention_heads", "num_heads", "vt_num_attention_heads"]))
self.gguf_writer.add_vision_head_count(self.find_vparam(["num_attention_heads", "num_heads", "heads", "vt_num_attention_heads"]))
# preprocessor config
image_mean = _MISTRAL_COMMON_DATASET_MEAN if self.is_mistral_format else self.preprocessor_config["image_mean"]
@@ -4949,6 +4949,73 @@ class Glm4VVisionModel(Qwen3VLVisionModel):
yield from super().modify_tensors(data_torch, name, bid)
@ModelBase.register("StepVLForConditionalGeneration")
class Step3VLVisionModel(MmprojModel):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
assert self.hparams_vision is not None
if not self.hparams_vision.get("intermediate_size"):
hidden_size = self.hparams_vision.get("hidden_size") or self.hparams_vision.get("width") or 0
assert hidden_size > 0
mlp_ratio = float(self.hparams_vision.get("mlp_ratio", 8960 / 1536))
self.hparams_vision["intermediate_size"] = int(round(hidden_size * mlp_ratio))
self.preprocessor_config.setdefault("image_mean", list(_MISTRAL_COMMON_DATASET_MEAN))
self.preprocessor_config.setdefault("image_std", list(_MISTRAL_COMMON_DATASET_STD))
def set_gguf_parameters(self):
super().set_gguf_parameters()
assert self.hparams_vision is not None
projector_stride = int(self.global_config.get("understand_projector_stride", -1))
hidden_size = int(self.hparams_vision.get("hidden_size", self.hparams_vision.get("width", -1)))
num_layers = int(self.hparams_vision.get("num_hidden_layers", self.hparams_vision.get("layers", -1)))
assert (projector_stride, int(self.hparams_vision.get("image_size", -1)), hidden_size, num_layers) == (2, 728, 1536, 47), (
"current Step3-VL conversion path is only validated for Step3-VL-10B"
)
self.gguf_writer.add_clip_projector_type(gguf.VisionProjectorType.STEP3VL)
self.gguf_writer.add_vision_attention_layernorm_eps(float(self.hparams_vision.get("layer_norm_eps", 1e-5)))
self.gguf_writer.add_vision_projector_scale_factor(projector_stride ** 2)
# 3024 max resize comes from step3-vl-10b processing_step3.py.
self.gguf_writer.add_vision_preproc_image_size(3024)
def tensor_force_quant(self, name, new_name, bid, n_dims):
if ".position_embd." in new_name:
return gguf.GGMLQuantizationType.F32
return super().tensor_force_quant(name, new_name, bid, n_dims)
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
if name.startswith("model.") or name.startswith("lm_head."):
return
if name.startswith("vision_model.vit_downsampler"):
match = re.match(r"vision_model\.vit_downsampler(\d+)\.(weight|bias)", name)
if match is None:
raise ValueError(f"Unexpected Step3-VL projector tensor {name!r}")
proj_id = int(match.group(1)) - 1
suffix = f".{match.group(2)}"
yield (self.format_tensor_name(gguf.MODEL_TENSOR.V_MMPROJ, proj_id, suffix=suffix), data_torch)
return
if name == "vit_large_projector.weight":
yield (self.format_tensor_name(gguf.MODEL_TENSOR.V_MMPROJ_FC), data_torch)
return
if name.startswith("vision_model."):
if name == "vision_model.positional_embedding":
name += ".weight"
elif name.endswith(".gamma") and ".ls_" in name:
name = name.removesuffix(".gamma") + ".weight"
name = name.replace("attn.in_proj_weight", "attn.in_proj.weight")
name = name.replace("attn.in_proj_bias", "attn.in_proj.bias")
yield from super().modify_tensors(data_torch, name, bid)
@ModelBase.register("Qwen3VLForConditionalGeneration")
class Qwen3VLTextModel(Qwen3Model):
model_arch = gguf.MODEL_ARCH.QWEN3VL
@@ -4969,6 +5036,16 @@ class Qwen3VLTextModel(Qwen3Model):
yield from super().modify_tensors(data_torch, name, bid)
@ModelBase.register("StepVLForConditionalGeneration")
class Step3VLTextModel(Qwen3Model):
model_arch = gguf.MODEL_ARCH.QWEN3
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
if name.startswith("vision_model.") or name.startswith("model.vision_model.") or name.startswith("vit_large_projector."):
return
yield from super().modify_tensors(data_torch, name, bid)
@ModelBase.register("Qwen3VLMoeForConditionalGeneration")
class Qwen3VLMoeTextModel(Qwen3MoeModel):
model_arch = gguf.MODEL_ARCH.QWEN3VLMOE
@@ -7472,7 +7549,7 @@ class Gemma4Model(Gemma3Model):
special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=True)
special_vocab.add_to_gguf(self.gguf_writer)
self.gguf_writer.add_add_space_prefix(False)
self.gguf_writer.add_add_bos_token(False) # already added via the chat template
self.gguf_writer.add_add_bos_token(True)
def set_gguf_parameters(self):
super().set_gguf_parameters()
@@ -11521,13 +11598,50 @@ class LLaDAMoEModel(TextModel):
raise ValueError(f"Unprocessed experts: {experts}")
@ModelBase.register("HunYuanDenseV1ForCausalLM")
@ModelBase.register("HunYuanDenseV1ForCausalLM", "HunYuanVLForConditionalGeneration")
class HunYuanModel(TextModel):
model_arch = gguf.MODEL_ARCH.HUNYUAN_DENSE
def _get_eod_token_id(self) -> int | None:
"""Get the actual end-of-generation token from config (eod_token_id)."""
return self.hparams.get("eod_token_id")
def _get_eot_token_id(self) -> int | None:
"""Get the end-of-turn token from generation_config.json.
This is the first entry in eos_token_id when it's a list."""
gen_cfg_path = self.dir_model / "generation_config.json"
if gen_cfg_path.is_file():
with open(gen_cfg_path, encoding="utf-8") as f:
gen_cfg = json.load(f)
eos = gen_cfg.get("eos_token_id")
if isinstance(eos, list) and len(eos) >= 2:
return eos[0]
return None
def _fix_special_tokens(self):
"""Fix EOS/EOT tokens that are incorrect in upstream configs."""
eod_id = self._get_eod_token_id()
if eod_id is not None:
self.gguf_writer.add_eos_token_id(eod_id)
eot_id = self._get_eot_token_id()
if eot_id is not None:
self.gguf_writer.add_eot_token_id(eot_id)
def set_vocab(self):
if (self.dir_model / "tokenizer.json").is_file():
self._set_vocab_gpt2()
tokens, toktypes, tokpre = self.get_vocab_base()
self.gguf_writer.add_tokenizer_model("gpt2")
self.gguf_writer.add_tokenizer_pre(tokpre)
self.gguf_writer.add_token_list(tokens)
self.gguf_writer.add_token_types(toktypes)
# HunyuanOCR has pad_token_id=-1 in config.json; exclude pad from SpecialVocab
token_types = None
if (self.hparams.get("pad_token_id") or 0) < 0:
token_types = ('bos', 'eos', 'unk', 'sep', 'cls', 'mask')
special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=True, special_token_types=token_types)
special_vocab.add_to_gguf(self.gguf_writer)
self._fix_special_tokens()
else:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(self.dir_model, trust_remote_code=True)
@@ -11579,13 +11693,18 @@ class HunYuanModel(TextModel):
# FIX for BOS token: Overwrite incorrect id read from config.json
if self.hparams['hidden_size'] == 4096:
self.gguf_writer.add_bos_token_id(127958) # only for 7b dense, fix <|bos|> token
self._fix_special_tokens()
def set_gguf_parameters(self):
# HunyuanOCR has num_experts=1 which is not MoE, prevent parent from writing it
saved_num_experts = self.hparams.pop("num_experts", None)
super().set_gguf_parameters()
if saved_num_experts is not None and saved_num_experts > 1:
self.hparams["num_experts"] = saved_num_experts
hparams = self.hparams
# Rope
if self.rope_parameters.get("rope_type") == "dynamic":
if self.rope_parameters.get("rope_type") in ("dynamic", "xdrope"):
# HunYuan uses NTK Aware Alpha based scaling. Original implementation: https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/
# 1000 corresponds to a usable context length of 256k (https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/report/Hunyuan_A13B_Technical_Report.pdf)
alpha = self.rope_parameters.get("alpha", 50)
@@ -11595,13 +11714,14 @@ class HunYuanModel(TextModel):
self.gguf_writer.add_rope_freq_base(scaled_base)
self.gguf_writer.add_rope_scaling_type(gguf.RopeScalingType.NONE)
self.gguf_writer.add_rope_scaling_factor(1)
# There is no consistent way to calculate ctx from alpha, and the config is incorrectly set to 32k
self.gguf_writer.add_rope_scaling_orig_ctx_len(256 * 1024) # 256k context length
self.gguf_writer.add_context_length(256 * 1024) # 256k context length
if self.rope_parameters.get("rope_type") == "dynamic":
# There is no consistent way to calculate ctx from alpha, and the config is incorrectly set to 32k
self.gguf_writer.add_rope_scaling_orig_ctx_len(256 * 1024) # 256k context length
self.gguf_writer.add_context_length(256 * 1024) # 256k context length
# if any of our assumptions about the values are wrong, something has changed and this may need to be updated
assert base == 10000.0 and self.hparams["max_position_embeddings"] in [32 * 1024, 256 * 1024] , \
"HunYuan dynamic RoPE scaling assumptions changed, please update the logic or context length manually"
# if any of our assumptions about the values are wrong, something has changed and this may need to be updated
assert base == 10000.0 and self.hparams["max_position_embeddings"] in [32 * 1024, 256 * 1024] , \
"HunYuan dynamic RoPE scaling assumptions changed, please update the logic or context length manually"
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
if name == "lm_head.weight":
@@ -11609,9 +11729,48 @@ class HunYuanModel(TextModel):
logger.info("Skipping tied output layer 'lm_head.weight'")
return
# skip vision tensors for HunyuanVL models
if name.startswith("vit."):
return
yield from super().modify_tensors(data_torch, name, bid)
@ModelBase.register("HunYuanVLForConditionalGeneration")
class HunyuanOCRVisionModel(MmprojModel):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
assert self.hparams_vision is not None
# HunyuanOCR uses max_image_size instead of image_size
if "image_size" not in self.hparams_vision:
self.hparams_vision["image_size"] = self.hparams_vision.get("max_image_size", 2048)
def set_gguf_parameters(self):
super().set_gguf_parameters()
assert self.hparams_vision is not None
hparams = self.hparams_vision
self.gguf_writer.add_clip_projector_type(gguf.VisionProjectorType.HUNYUANOCR)
self.gguf_writer.add_vision_use_gelu(True)
self.gguf_writer.add_vision_attention_layernorm_eps(hparams.get("rms_norm_eps", 1e-5))
self.gguf_writer.add_vision_spatial_merge_size(hparams.get("spatial_merge_size", 2))
self.gguf_writer.add_vision_min_pixels(self.preprocessor_config["min_pixels"])
self.gguf_writer.add_vision_max_pixels(self.preprocessor_config["max_pixels"])
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
if not name.startswith("vit."):
return # skip text tensors
# strip CLS token (row 0) from position embeddings so resize_position_embeddings works
if "position_embedding" in name:
data_torch = data_torch[1:] # [n_patches+1, n_embd] -> [n_patches, n_embd]
yield from super().modify_tensors(data_torch, name, bid)
def tensor_force_quant(self, name, new_name, bid, n_dims):
# force conv weights to F32 or F16 to avoid BF16 IM2COL issues on Metal
if ("mm.0." in new_name or "mm.2." in new_name) and new_name.endswith(".weight"):
return gguf.GGMLQuantizationType.F16 if self.ftype == gguf.LlamaFileType.MOSTLY_F16 else gguf.GGMLQuantizationType.F32
return super().tensor_force_quant(name, new_name, bid, n_dims)
@ModelBase.register("SmolLM3ForCausalLM")
class SmolLM3Model(LlamaModel):
model_arch = gguf.MODEL_ARCH.SMOLLM3
@@ -11736,10 +11895,8 @@ class LFM2Model(TextModel):
model_arch = gguf.MODEL_ARCH.LFM2
def _add_feed_forward_length(self):
ff_dim = self.hparams["block_ff_dim"]
ff_dim = self.find_hparam(["block_ff_dim", "intermediate_size"])
auto_adjust_ff_dim = self.hparams["block_auto_adjust_ff_dim"]
ff_dim = self.hparams["block_ff_dim"]
ffn_dim_multiplier = self.hparams["block_ffn_dim_multiplier"]
multiple_of = self.hparams["block_multiple_of"]
@@ -12914,6 +13071,12 @@ def get_model_architecture(hparams: dict[str, Any], model_type: ModelType) -> st
# For non-hf Mamba and Mamba2 models
arch = hparams["ssm_cfg"].get("layer", "Mamba") + "ForCausalLM"
# Step3-VL keeps text config under text_config but uses a custom top-level architecture.
# For text conversion we route to a dedicated text-only class.
# TODO: refactor this later to avoid adding exception here
if model_type == ModelType.TEXT and arch == "StepVLForConditionalGeneration":
return arch
# if "architectures" is found in the sub-config, use that instead
if model_type == ModelType.TEXT and text_config.get("architectures") is not None:
arch = text_config["architectures"][0]
+1 -1
View File
@@ -741,7 +741,7 @@ cmake --build build --config Release
WebGPU allows cross-platform access to the GPU from supported browsers. We utilize [Emscripten](https://emscripten.org/) to compile ggml's WebGPU backend to WebAssembly. Emscripten does not officially support WebGPU bindings yet, but Dawn currently maintains its own WebGPU bindings called emdawnwebgpu.
Follow the instructions [here](https://dawn.googlesource.com/dawn/+/refs/heads/main/src/emdawnwebgpu/) to download or build the emdawnwebgpu package (Note that it might be safer to build the emdawbwebgpu package locally, so that it stays in sync with the version of Dawn you have installed above). When building using CMake, the path to the emdawnwebgpu port file needs to be set with the flag `EMDAWNWEBGPU_DIR`.
Follow the instructions [here](https://dawn.googlesource.com/dawn/+/refs/heads/main/src/emdawnwebgpu/) to download or build the emdawnwebgpu package (Note that it might be safer to build the emdawnwebgpu package locally, so that it stays in sync with the version of Dawn you have installed above). When building using CMake, the path to the emdawnwebgpu port file needs to be set with the flag `EMDAWNWEBGPU_DIR`.
## IBM Z & LinuxONE
+1
View File
@@ -37,6 +37,7 @@ llama-server -hf ggml-org/gemma-3-4b-it-GGUF --no-mmproj-offload
> - PaddleOCR-VL: https://github.com/ggml-org/llama.cpp/pull/18825
> - GLM-OCR: https://github.com/ggml-org/llama.cpp/pull/19677
> - Deepseek-OCR: https://github.com/ggml-org/llama.cpp/pull/17400
> - HunyuanOCR: https://github.com/ggml-org/llama.cpp/pull/21395
## Pre-quantized models
+1 -1
View File
@@ -68,7 +68,7 @@ Legend:
| MEAN | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| MUL | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| MUL_MAT | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 |
| MUL_MAT_ID | ❌ | 🟡 | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | | 🟡 | ❌ |
| MUL_MAT_ID | ❌ | 🟡 | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | 🟡 | ❌ |
| NEG | ❌ | ✅ | ✅ | 🟡 | ✅ | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| NORM | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 🟡 | ❌ | ❌ | ❌ |
| OPT_STEP_ADAMW | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
+527 -618
View File
File diff suppressed because it is too large Load Diff
+5 -1
View File
@@ -9,6 +9,7 @@
#include <vector>
#include <filesystem>
#include <fstream>
#include <optional>
#include <regex>
static void print_usage(int /*argc*/, char ** argv) {
@@ -222,7 +223,10 @@ int main(int argc, char ** argv) {
llama_backend_init();
llama_numa_init(params.numa);
base_callback_data cb_data(params, params.tensor_filter);
std::optional<base_callback_data> cb_data;
if (!params.save_logits) {
cb_data.emplace(params, params.tensor_filter);
}
auto llama_init = common_init_from_params(params);
+9 -5
View File
@@ -428,7 +428,8 @@ extern "C" {
// GGML_TYPE_IQ4_NL_8_8 = 38,
GGML_TYPE_MXFP4 = 39, // MXFP4 (1 block)
GGML_TYPE_NVFP4 = 40, // NVFP4 (4 blocks, E4M3 scale)
GGML_TYPE_COUNT = 41,
GGML_TYPE_Q1_0 = 41,
GGML_TYPE_COUNT = 42,
};
// precision
@@ -465,6 +466,7 @@ extern "C" {
GGML_FTYPE_MOSTLY_BF16 = 24, // except 1d tensors
GGML_FTYPE_MOSTLY_MXFP4 = 25, // except 1d tensors
GGML_FTYPE_MOSTLY_NVFP4 = 26, // except 1d tensors
GGML_FTYPE_MOSTLY_Q1_0 = 27, // except 1d tensors
};
// available tensor operations:
@@ -900,15 +902,17 @@ extern "C" {
struct ggml_tensor * b,
struct ggml_tensor * ids);
GGML_API struct ggml_tensor * ggml_add1(
GGML_DEPRECATED(GGML_API struct ggml_tensor * ggml_add1(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b);
struct ggml_tensor * b),
"use ggml_add instead");
GGML_API struct ggml_tensor * ggml_add1_inplace(
GGML_DEPRECATED(GGML_API struct ggml_tensor * ggml_add1_inplace(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b);
struct ggml_tensor * b),
"use ggml_add_inplace instead");
// dst = a
// view(dst, nb1, nb2, nb3, offset) += b
+11
View File
@@ -93,6 +93,10 @@ typedef sycl::half2 ggml_half2;
// QR = QK / number of values before dequantization
// QI = number of 32 bit integers before dequantization
#define QI1_0 (QK1_0 / 32)
#define QR1_0 1
#define QI4_0 (QK4_0 / (4 * QR4_0))
#define QR4_0 2
@@ -170,6 +174,13 @@ typedef sycl::half2 ggml_half2;
#define GGML_EXTENSION __extension__
#endif // _MSC_VER
#define QK1_0 128
typedef struct {
ggml_half d; // delta
uint8_t qs[QK1_0 / 8]; // bits / quants
} block_q1_0;
static_assert(sizeof(block_q1_0) == sizeof(ggml_half) + QK1_0 / 8, "wrong q1_0 block size/padding");
#define QK4_0 32
typedef struct {
ggml_half d; // delta
+7
View File
@@ -16,6 +16,7 @@
#define ggml_vec_dot_q8_0_q8_0_generic ggml_vec_dot_q8_0_q8_0
#define ggml_vec_dot_mxfp4_q8_0_generic ggml_vec_dot_mxfp4_q8_0
#define ggml_vec_dot_nvfp4_q8_0_generic ggml_vec_dot_nvfp4_q8_0
#define ggml_vec_dot_q1_0_q8_0_generic ggml_vec_dot_q1_0_q8_0
#define ggml_vec_dot_tq1_0_q8_K_generic ggml_vec_dot_tq1_0_q8_K
#define ggml_vec_dot_tq2_0_q8_K_generic ggml_vec_dot_tq2_0_q8_K
#define ggml_vec_dot_q2_K_q8_K_generic ggml_vec_dot_q2_K_q8_K
@@ -82,6 +83,7 @@
#elif defined(__x86_64__) || defined(__i386__) || defined(_M_IX86) || defined(_M_X64)
// quants.c
#define ggml_vec_dot_nvfp4_q8_0_generic ggml_vec_dot_nvfp4_q8_0
#define ggml_vec_dot_q1_0_q8_0_generic ggml_vec_dot_q1_0_q8_0
// repack.cpp
#define ggml_quantize_mat_q8_0_4x4_generic ggml_quantize_mat_q8_0_4x4
#define ggml_quantize_mat_q8_K_4x4_generic ggml_quantize_mat_q8_K_4x4
@@ -112,6 +114,7 @@
// quants.c
#define quantize_row_q8_K_generic quantize_row_q8_K
#define ggml_vec_dot_nvfp4_q8_0_generic ggml_vec_dot_nvfp4_q8_0
#define ggml_vec_dot_q1_0_q8_0_generic ggml_vec_dot_q1_0_q8_0
#define ggml_vec_dot_tq1_0_q8_K_generic ggml_vec_dot_tq1_0_q8_K
#define ggml_vec_dot_tq2_0_q8_K_generic ggml_vec_dot_tq2_0_q8_K
#define ggml_vec_dot_iq1_m_q8_K_generic ggml_vec_dot_iq1_m_q8_K
@@ -160,6 +163,7 @@
#define ggml_vec_dot_iq1_m_q8_K_generic ggml_vec_dot_iq1_m_q8_K
#define ggml_vec_dot_mxfp4_q8_0_generic ggml_vec_dot_mxfp4_q8_0
#define ggml_vec_dot_nvfp4_q8_0_generic ggml_vec_dot_nvfp4_q8_0
#define ggml_vec_dot_q1_0_q8_0_generic ggml_vec_dot_q1_0_q8_0
// repack.cpp
#define ggml_quantize_mat_q8_0_4x4_generic ggml_quantize_mat_q8_0_4x4
#define ggml_quantize_mat_q8_0_4x8_generic ggml_quantize_mat_q8_0_4x8
@@ -200,6 +204,7 @@
#elif defined(__riscv)
// quants.c
#define ggml_vec_dot_nvfp4_q8_0_generic ggml_vec_dot_nvfp4_q8_0
#define ggml_vec_dot_q1_0_q8_0_generic ggml_vec_dot_q1_0_q8_0
// repack.cpp
#define ggml_quantize_mat_q8_0_4x1_generic ggml_quantize_mat_q8_0_4x1
#define ggml_quantize_mat_q8_0_4x4_generic ggml_quantize_mat_q8_0_4x4
@@ -240,6 +245,7 @@
// quants.c
#define quantize_row_q8_K_generic quantize_row_q8_K
#define ggml_vec_dot_nvfp4_q8_0_generic ggml_vec_dot_nvfp4_q8_0
#define ggml_vec_dot_q1_0_q8_0_generic ggml_vec_dot_q1_0_q8_0
#define ggml_vec_dot_tq1_0_q8_K_generic ggml_vec_dot_tq1_0_q8_K
#define ggml_vec_dot_tq2_0_q8_K_generic ggml_vec_dot_tq2_0_q8_K
#define ggml_vec_dot_q2_K_q8_K_generic ggml_vec_dot_q2_K_q8_K
@@ -303,6 +309,7 @@
#define ggml_vec_dot_iq4_xs_q8_K_generic ggml_vec_dot_iq4_xs_q8_K
#define ggml_vec_dot_mxfp4_q8_0_generic ggml_vec_dot_mxfp4_q8_0
#define ggml_vec_dot_nvfp4_q8_0_generic ggml_vec_dot_nvfp4_q8_0
#define ggml_vec_dot_q1_0_q8_0_generic ggml_vec_dot_q1_0_q8_0
// repack.cpp
#define ggml_quantize_mat_q8_0_4x4_generic ggml_quantize_mat_q8_0_4x4
#define ggml_quantize_mat_q8_0_4x8_generic ggml_quantize_mat_q8_0_4x8
+103
View File
@@ -137,6 +137,109 @@ void quantize_row_q8_K(const float * GGML_RESTRICT x, void * GGML_RESTRICT y, in
//===================================== Dot products =================================
void ggml_vec_dot_q1_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
const int qk = QK1_0; // 128
const int nb = n / qk;
assert(n % qk == 0);
assert(nrc == 1);
UNUSED(nrc);
UNUSED(bx);
UNUSED(by);
UNUSED(bs);
const block_q1_0 * GGML_RESTRICT x = vx;
const block_q8_0 * GGML_RESTRICT y = vy;
float sumf = 0.0f;
#if defined(__ARM_NEON)
float32x4_t sumv = vdupq_n_f32(0.0f);
for (int i = 0; i < nb; i++) {
const float d0 = GGML_CPU_FP16_TO_FP32(x[i].d);
// Process 4 Q8_0 blocks (each has 32 elements)
for (int k = 0; k < 4; k++) {
const block_q8_0 * GGML_RESTRICT yb = &y[i * 4 + k];
const float d1 = GGML_CPU_FP16_TO_FP32(yb->d);
// Get the 4 bytes of bits for this Q8_0 block (32 bits = 4 bytes)
// Bits are at offset k*4 bytes in x[i].qs
const uint8_t * bits = &x[i].qs[k * 4];
// Load 32 int8 values from y
const int8x16_t y0 = vld1q_s8(yb->qs);
const int8x16_t y1 = vld1q_s8(yb->qs + 16);
// Byte 0-1: bits for y0[0..15]
const uint64_t expand0 = table_b2b_0[bits[0]];
const uint64_t expand1 = table_b2b_0[bits[1]];
// Byte 2-3: bits for y1[0..15]
const uint64_t expand2 = table_b2b_0[bits[2]];
const uint64_t expand3 = table_b2b_0[bits[3]];
// Build the sign vectors by reinterpreting the table values
uint8x8_t e0 = vcreate_u8(expand0);
uint8x8_t e1 = vcreate_u8(expand1);
uint8x8_t e2 = vcreate_u8(expand2);
uint8x8_t e3 = vcreate_u8(expand3);
// Shift right by 4 to get 0 or 1
int8x8_t s0 = vreinterpret_s8_u8(vshr_n_u8(e0, 4));
int8x8_t s1 = vreinterpret_s8_u8(vshr_n_u8(e1, 4));
int8x8_t s2 = vreinterpret_s8_u8(vshr_n_u8(e2, 4));
int8x8_t s3 = vreinterpret_s8_u8(vshr_n_u8(e3, 4));
// Convert 0/1 to -1/+1: sign = 2*val - 1
int8x8_t one = vdup_n_s8(1);
s0 = vsub_s8(vadd_s8(s0, s0), one); // 2*s0 - 1
s1 = vsub_s8(vadd_s8(s1, s1), one);
s2 = vsub_s8(vadd_s8(s2, s2), one);
s3 = vsub_s8(vadd_s8(s3, s3), one);
// Combine into 16-element vectors
int8x16_t signs0 = vcombine_s8(s0, s1);
int8x16_t signs1 = vcombine_s8(s2, s3);
// Multiply signs with y values and accumulate
// dot(signs, y) where signs are +1/-1
int32x4_t p0 = ggml_vdotq_s32(vdupq_n_s32(0), signs0, y0);
int32x4_t p1 = ggml_vdotq_s32(p0, signs1, y1);
// Scale by d1 and accumulate
sumv = vmlaq_n_f32(sumv, vcvtq_f32_s32(p1), d0 * d1);
}
}
sumf = vaddvq_f32(sumv);
#else
// Scalar fallback
for (int i = 0; i < nb; i++) {
const float d0 = GGML_FP16_TO_FP32(x[i].d);
// Process 4 Q8_0 blocks
for (int k = 0; k < 4; k++) {
const float d1 = GGML_FP16_TO_FP32(y[i*4 + k].d);
int sumi = 0;
for (int j = 0; j < QK8_0; j++) {
const int bit_index = k * QK8_0 + j;
const int byte_index = bit_index / 8;
const int bit_offset = bit_index % 8;
const int xi = ((x[i].qs[byte_index] >> bit_offset) & 1) ? 1 : -1;
sumi += xi * y[i*4 + k].qs[j];
}
sumf += d0 * d1 * sumi;
}
}
#endif
*s = sumf;
}
void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
const int qk = QK8_0;
const int nb = n / qk;
@@ -2156,4 +2156,3 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
ggml_vec_dot_iq4_xs_q8_K_generic(n, s, bs, vx, bx, vy, by, nrc);
#endif
}
-1
View File
@@ -2302,4 +2302,3 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
ggml_vec_dot_iq4_xs_q8_K_generic(n, s, bs, vx, bx, vy, by, nrc);
#endif
}
-1
View File
@@ -1463,4 +1463,3 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
ggml_vec_dot_iq4_xs_q8_K_generic(n, s, bs, vx, bx, vy, by, nrc);
#endif
}
-1
View File
@@ -1218,4 +1218,3 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
ggml_vec_dot_q6_K_q8_K_generic(n, s, bs, vx, bx, vy, by, nrc);
#endif
}
+6
View File
@@ -217,6 +217,12 @@ static const struct ggml_type_traits_cpu type_traits_cpu[GGML_TYPE_COUNT] = {
.vec_dot_type = GGML_TYPE_F16,
.nrows = 1,
},
[GGML_TYPE_Q1_0] = {
.from_float = quantize_row_q1_0,
.vec_dot = ggml_vec_dot_q1_0_q8_0,
.vec_dot_type = GGML_TYPE_Q8_0,
.nrows = 1,
},
[GGML_TYPE_Q4_0] = {
.from_float = quantize_row_q4_0,
.vec_dot = ggml_vec_dot_q4_0_q8_0,
+2
View File
@@ -4829,6 +4829,7 @@ void ggml_compute_forward_get_rows(
const ggml_tensor * src0 = dst->src[0];
switch (src0->type) {
case GGML_TYPE_Q1_0:
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1:
case GGML_TYPE_Q5_0:
@@ -5554,6 +5555,7 @@ void ggml_compute_forward_clamp(
ggml_compute_forward_clamp_f16(params, dst);
} break;
case GGML_TYPE_BF16:
case GGML_TYPE_Q1_0:
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1:
case GGML_TYPE_Q5_0:
+49
View File
@@ -22,6 +22,10 @@
#define UNUSED GGML_UNUSED
void quantize_row_q1_0(const float * GGML_RESTRICT x, void * GGML_RESTRICT y, int64_t k) {
quantize_row_q1_0_ref(x, y, k);
}
void quantize_row_q4_0(const float * GGML_RESTRICT x, void * GGML_RESTRICT y, int64_t k) {
quantize_row_q4_0_ref(x, y, k);
}
@@ -116,6 +120,51 @@ void quantize_row_q8_K_generic(const float * GGML_RESTRICT x, void * GGML_RESTRI
//===================================== Dot products =================================
void ggml_vec_dot_q1_0_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
const int qk = QK1_0;
const int nb = n / qk;
assert(n % qk == 0);
assert(nrc == 1);
UNUSED(nrc);
UNUSED(bx);
UNUSED(by);
UNUSED(bs);
const block_q1_0 * GGML_RESTRICT x = vx;
const block_q8_0 * GGML_RESTRICT y = vy;
float sumf = 0.0;
for (int i = 0; i < nb; i++) {
const float d0 = GGML_FP16_TO_FP32(x[i].d);
float sumi = 0.0f;
for (int k = 0; k < 4; k++) {
const float d1 = GGML_FP16_TO_FP32(y[i*4 + k].d);
int sumi_block = 0;
for (int j = 0; j < QK8_0; j++) {
const int bit_index = k * QK8_0 + j;
const int byte_index = bit_index / 8;
const int bit_offset = bit_index % 8;
const int xi = ((x[i].qs[byte_index] >> bit_offset) & 1) ? 1 : -1;
sumi_block += xi * y[i*4 + k].qs[j];
}
sumi += d1 * sumi_block;
}
sumf += d0 * sumi;
}
*s = sumf;
}
void ggml_vec_dot_q4_0_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
const int qk = QK8_0;
const int nb = n / qk;
+3
View File
@@ -12,6 +12,7 @@ extern "C" {
#endif
// Quantization
void quantize_row_q1_0(const float * GGML_RESTRICT x, void * GGML_RESTRICT y, int64_t k);
void quantize_row_q4_0(const float * GGML_RESTRICT x, void * GGML_RESTRICT y, int64_t k);
void quantize_row_q4_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT y, int64_t k);
void quantize_row_q5_0(const float * GGML_RESTRICT x, void * GGML_RESTRICT y, int64_t k);
@@ -36,6 +37,7 @@ void quantize_row_iq4_nl (const float * GGML_RESTRICT x, void * GGML_RESTRICT y,
void quantize_row_iq4_xs (const float * GGML_RESTRICT x, void * GGML_RESTRICT y, int64_t k);
// Dot product
void ggml_vec_dot_q1_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc);
void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc);
void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc);
void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc);
@@ -68,6 +70,7 @@ void ggml_vec_dot_iq3_s_q8_K (int n, float * GGML_RESTRICT s, size_t bs, const
void quantize_row_q8_0_generic(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);
void quantize_row_q8_1_generic(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);
void quantize_row_q8_K_generic(const float * GGML_RESTRICT x, void * GGML_RESTRICT y, int64_t k);
void ggml_vec_dot_q1_0_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc);
void ggml_vec_dot_q4_0_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc);
void ggml_vec_dot_q4_1_q8_1_generic(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc);
void ggml_vec_dot_q5_0_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc);
+2 -21
View File
@@ -65,7 +65,7 @@
#define GGML_CUDA_CC_VEGA (GGML_CUDA_CC_OFFSET_AMD + 0x900) // Vega56/64, minimum for fp16 dual issue
#define GGML_CUDA_CC_VEGA20 (GGML_CUDA_CC_OFFSET_AMD + 0x906) // MI50/Radeon VII, minimum for dp4a
#define GGML_CUDA_CC_CDNA1 (GGML_CUDA_CC_OFFSET_AMD + 0x908) // MI100, minimum for MFMA, acc registers
#define GGML_CUDA_CC_CDNA2 (GGML_CUDA_CC_OFFSET_AMD + 0x910) // MI210, minimum acc register renameing
#define GGML_CUDA_CC_CDNA2 (GGML_CUDA_CC_OFFSET_AMD + 0x90a) // MI210 (gfx90a), minimum acc register renaming
#define GGML_CUDA_CC_CDNA3 (GGML_CUDA_CC_OFFSET_AMD + 0x942) // MI300
// RDNA removes MFMA, dp4a, xnack, acc registers, wave size is 32
@@ -1157,19 +1157,6 @@ struct ggml_tensor_extra_gpu {
#define USE_CUDA_GRAPH
#endif
struct ggml_cuda_graph_node_properties {
void * node_data;
ggml_op node_op;
enum ggml_type node_type;
int32_t flags;
int64_t ne[GGML_MAX_DIMS];
size_t nb[GGML_MAX_DIMS];
void * src_data[GGML_MAX_SRC];
int32_t op_params[GGML_MAX_OP_PARAMS / sizeof(int32_t)];
};
static_assert(std::is_trivial<ggml_cuda_graph_node_properties>::value, "ggml_cuda_graph_node_properties must be trivial");
struct ggml_cuda_graph {
#ifdef USE_CUDA_GRAPH
~ggml_cuda_graph() {
@@ -1186,13 +1173,7 @@ struct ggml_cuda_graph {
std::vector<cudaGraphNode_t> nodes;
bool disable_due_to_gpu_arch = false;
bool warmup_complete = false;
std::vector<ggml_cuda_graph_node_properties> props;
// these are extra tensors (inputs) that participate in the ggml graph but are not nodes
// they properties also have to match in order to be able to safely reuse a CUDA graph
// ref: https://github.com/ggml-org/llama.cpp/pull/18583
// ref: https://github.com/ggml-org/llama.cpp/pull/19165
std::vector<ggml_cuda_graph_node_properties> extra;
std::vector<ggml_tensor> nodes_copy;
bool is_enabled() const {
static const bool disable_cuda_graphs_due_to_env = (getenv("GGML_CUDA_DISABLE_GRAPHS") != nullptr);
+153 -25
View File
@@ -676,9 +676,96 @@ static __global__ void flash_attn_mask_to_KV_max(
template<int D, int ncols1, int ncols2> // D == head size
__launch_bounds__(D, 1)
static __global__ void flash_attn_stream_k_fixup(
float * __restrict__ dst, const float2 * __restrict__ dst_fixup, const int ne01, const int ne02, const int ne03,
const int ne11, const int ne12, const int nbatch_fa) {
static __global__ void flash_attn_stream_k_fixup_uniform(
float * __restrict__ dst,
const float2 * __restrict__ dst_fixup,
const int ne01, const int ne02,
const int ne12, const int nblocks_stream_k,
const int gqa_ratio,
const int blocks_per_tile,
const uint3 fd_iter_j_z_ne12,
const uint3 fd_iter_j_z,
const uint3 fd_iter_j) {
constexpr int ncols = ncols1*ncols2;
const int tile_idx = blockIdx.x; // One block per output tile.
const int j = blockIdx.y;
const int c = blockIdx.z;
const int jc = j*ncols2 + c;
const int tid = threadIdx.x;
// nblocks_stream_k is a multiple of ntiles_dst (== gridDim.x), so each tile gets the same number of blocks.
const int b_first = tile_idx * blocks_per_tile;
const int b_last = b_first + blocks_per_tile - 1;
const float * dst_fixup_data = ((const float *) dst_fixup) + nblocks_stream_k*(2*2*ncols);
// z_KV == K/V head index, zt_gqa = Q head start index per K/V head, jt = token position start index
const uint2 dm0 = fast_div_modulo(tile_idx, fd_iter_j_z_ne12);
const uint2 dm1 = fast_div_modulo(dm0.y, fd_iter_j_z);
const uint2 dm2 = fast_div_modulo(dm1.y, fd_iter_j);
const int sequence = dm0.x;
const int z_KV = dm1.x;
const int zt_gqa = dm2.x;
const int jt = dm2.y;
const int zt_Q = z_KV*gqa_ratio + zt_gqa*ncols2; // Global Q head start index.
if (jt*ncols1 + j >= ne01 || zt_gqa*ncols2 + c >= gqa_ratio) {
return;
}
dst += sequence*ne02*ne01*D + jt*ne02*(ncols1*D) + zt_Q*D + (j*ne02 + c)*D + tid;
// Load the partial result that needs a fixup
float dst_val = *dst;
float max_val;
float rowsum;
{
const float2 tmp = dst_fixup[b_last*ncols + jc];
max_val = tmp.x;
rowsum = tmp.y;
}
// Combine with all previous blocks in this tile.
for (int bidx = b_last - 1; bidx >= b_first; --bidx) {
const float dst_add = dst_fixup_data[bidx*ncols*D + jc*D + tid];
const float2 tmp = dst_fixup[(nblocks_stream_k + bidx)*ncols + jc];
const float max_val_new = fmaxf(max_val, tmp.x);
const float diff_val = max_val - max_val_new;
const float diff_add = tmp.x - max_val_new;
const float scale_val = diff_val >= SOFTMAX_FTZ_THRESHOLD ? expf(diff_val) : 0.0f;
const float scale_add = diff_add >= SOFTMAX_FTZ_THRESHOLD ? expf(diff_add) : 0.0f;
dst_val = scale_val*dst_val + scale_add*dst_add;
rowsum = scale_val*rowsum + scale_add*tmp.y;
max_val = max_val_new;
}
// Write back final result:
*dst = dst_val / rowsum;
}
// General fixup kernel for the case where the number of blocks per tile is not uniform across tiles
// (blocks_num.x not a multiple of ntiles_dst)
template <int D, int ncols1, int ncols2> // D == head size
__launch_bounds__(D, 1)
static __global__ void flash_attn_stream_k_fixup_general(
float * __restrict__ dst,
const float2 * __restrict__ dst_fixup,
const int ne01, const int ne02,
const int gqa_ratio,
const int total_work,
const uint3 fd_iter_k_j_z_ne12,
const uint3 fd_iter_k_j_z,
const uint3 fd_iter_k_j,
const uint3 fd_iter_k) {
constexpr int ncols = ncols1*ncols2;
const int bidx0 = blockIdx.x;
@@ -689,27 +776,26 @@ static __global__ void flash_attn_stream_k_fixup(
const float * dst_fixup_data = ((const float *) dst_fixup) + gridDim.x*(2*2*ncols);
const int gqa_ratio = ne02 / ne12; // With grouped query attention there are > 1 Q matrices per K, V matrix.
const int iter_k = (ne11 + (nbatch_fa - 1)) / nbatch_fa;
const int iter_j = (ne01 + (ncols1 - 1)) / ncols1;
const int iter_z_gqa = (gqa_ratio + (ncols2 - 1)) / ncols2;
const int kbc0 = int64_t(bidx0 + 0)*(iter_k*iter_j*iter_z_gqa*ne12*ne03) / gridDim.x;
const int kbc0_stop = int64_t(bidx0 + 1)*(iter_k*iter_j*iter_z_gqa*ne12*ne03) / gridDim.x;
const int kbc0 = int64_t(bidx0 + 0)*total_work / gridDim.x;
const int kbc0_stop = int64_t(bidx0 + 1)*total_work / gridDim.x;
const bool did_not_have_any_data = kbc0 == kbc0_stop;
const bool wrote_beginning_of_tile = kbc0 % iter_k == 0;
const bool did_not_write_last = kbc0/iter_k == kbc0_stop/iter_k && kbc0_stop % iter_k != 0;
const bool wrote_beginning_of_tile = fastmodulo(kbc0, fd_iter_k) == 0;
const bool did_not_write_last = fastdiv(kbc0, fd_iter_k) == fastdiv(kbc0_stop, fd_iter_k) && fastmodulo(kbc0_stop, fd_iter_k) != 0;
if (did_not_have_any_data || wrote_beginning_of_tile || did_not_write_last) {
return;
}
// z_KV == K/V head index, zt_gqa = Q head start index per K/V head, jt = token position start index
const int sequence = kbc0 /(iter_k*iter_j*iter_z_gqa*ne12);
const int z_KV = (kbc0 - iter_k*iter_j*iter_z_gqa*ne12 * sequence)/(iter_k*iter_j*iter_z_gqa);
const int zt_gqa = (kbc0 - iter_k*iter_j*iter_z_gqa*ne12 * sequence - iter_k*iter_j*iter_z_gqa * z_KV)/(iter_k*iter_j);
const int jt = (kbc0 - iter_k*iter_j*iter_z_gqa*ne12 * sequence - iter_k*iter_j*iter_z_gqa * z_KV - iter_k*iter_j * zt_gqa) / iter_k;
const uint2 dm0 = fast_div_modulo(kbc0, fd_iter_k_j_z_ne12);
const uint2 dm1 = fast_div_modulo(dm0.y, fd_iter_k_j_z);
const uint2 dm2 = fast_div_modulo(dm1.y, fd_iter_k_j);
const uint2 dm3 = fast_div_modulo(dm2.y, fd_iter_k);
const int sequence = dm0.x;
const int z_KV = dm1.x;
const int zt_gqa = dm2.x;
const int jt = dm3.x;
const int zt_Q = z_KV*gqa_ratio + zt_gqa*ncols2; // Global Q head start index.
@@ -733,10 +819,11 @@ static __global__ void flash_attn_stream_k_fixup(
// Iterate over previous blocks and compute the combined results.
// All CUDA blocks that get here must have a previous block that needs a fixup.
const int tile_kbc0 = fastdiv(kbc0, fd_iter_k);
int bidx = bidx0 - 1;
int kbc_stop = kbc0;
while(true) {
const int kbc = int64_t(bidx)*(iter_k*iter_j*iter_z_gqa*ne12*ne03) / gridDim.x;
const int kbc = int64_t(bidx)*total_work / gridDim.x;
if (kbc == kbc_stop) { // Did not have any data.
bidx--;
kbc_stop = kbc;
@@ -762,7 +849,7 @@ static __global__ void flash_attn_stream_k_fixup(
max_val = max_val_new;
// If this block started in a previous tile we are done and don't need to combine additional partial results.
if (kbc % iter_k == 0 || kbc/iter_k < kbc0/iter_k) {
if (fastmodulo(kbc, fd_iter_k) == 0 || fastdiv(kbc, fd_iter_k) < tile_kbc0) {
break;
}
bidx--;
@@ -976,14 +1063,28 @@ void launch_fattn(
const int tiles_nwaves = (ntiles_dst + max_blocks - 1) / max_blocks;
const int tiles_efficiency_percent = 100 * ntiles_dst / (max_blocks*tiles_nwaves);
const int nblocks_stream_k = std::min(max_blocks, ntiles_KV*ntiles_dst);
const bool use_stream_k = cc >= GGML_CUDA_CC_ADA_LOVELACE || amd_wmma_available(cc) || tiles_efficiency_percent < 75;
blocks_num.x = use_stream_k ? nblocks_stream_k : ntiles_dst;
blocks_num.x = ntiles_dst;
blocks_num.y = 1;
blocks_num.z = 1;
if(use_stream_k) {
const int nblocks_stream_k_raw = std::min(max_blocks, ntiles_KV*ntiles_dst);
// Round down to a multiple of ntiles_dst so that each output tile gets the same number of blocks (avoids fixup).
// Only do this if the occupancy loss from rounding is acceptable.
const int nblocks_stream_k_rounded = (nblocks_stream_k_raw / ntiles_dst) * ntiles_dst;
const int max_efficiency_loss_percent = 5;
const int efficiency_loss_percent = nblocks_stream_k_rounded > 0
? 100 * (nblocks_stream_k_raw - nblocks_stream_k_rounded) / nblocks_stream_k_raw
: 100;
const int nblocks_stream_k = efficiency_loss_percent <= max_efficiency_loss_percent
? nblocks_stream_k_rounded
: nblocks_stream_k_raw;
blocks_num.x = nblocks_stream_k;
}
if (ntiles_dst % blocks_num.x != 0) { // Fixup is only needed if the SMs work on fractional tiles.
dst_tmp_meta.alloc((size_t(blocks_num.x) * ncols * (2 + DV/2)));
}
@@ -1063,13 +1164,40 @@ void launch_fattn(
CUDA_CHECK(cudaGetLastError());
if (stream_k) {
if (ntiles_dst % blocks_num.x != 0) { // Fixup is only needed if the SMs work on fractional tiles.
if ((int)blocks_num.x % ntiles_dst == 0 && (int)blocks_num.x > ntiles_dst) {
// Optimized fixup: nblocks_stream_k is a multiple of ntiles_dst, launch one block per tile.
const int nblocks_sk = (int)blocks_num.x;
const int bpt = nblocks_sk / ntiles_dst;
const uint3 fd0 = init_fastdiv_values(ntiles_x * ntiles_z_gqa * K->ne[2]);
const uint3 fd1 = init_fastdiv_values(ntiles_x * ntiles_z_gqa);
const uint3 fd2 = init_fastdiv_values(ntiles_x);
const dim3 block_dim_combine(DV, 1, 1);
const dim3 blocks_num_combine = {(unsigned)ntiles_dst, ncols1, ncols2};
flash_attn_stream_k_fixup_uniform<DV, ncols1, ncols2>
<<<blocks_num_combine, block_dim_combine, 0, main_stream>>>
((float *) KQV->data, dst_tmp_meta.ptr,
Q->ne[1], Q->ne[2], K->ne[2], nblocks_sk,
gqa_ratio, bpt, fd0, fd1, fd2);
} else if (ntiles_dst % blocks_num.x != 0) {
// General fixup for the cases where nblocks_stream_k < ntiles_dst.
const int total_work = ntiles_KV * ntiles_dst;
const uint3 fd_k_j_z_ne12 = init_fastdiv_values(ntiles_KV * ntiles_x * ntiles_z_gqa * K->ne[2]);
const uint3 fd_k_j_z = init_fastdiv_values(ntiles_KV * ntiles_x * ntiles_z_gqa);
const uint3 fd_k_j = init_fastdiv_values(ntiles_KV * ntiles_x);
const uint3 fd_k = init_fastdiv_values(ntiles_KV);
const dim3 block_dim_combine(DV, 1, 1);
const dim3 blocks_num_combine = {blocks_num.x, ncols1, ncols2};
flash_attn_stream_k_fixup<DV, ncols1, ncols2>
flash_attn_stream_k_fixup_general<DV, ncols1, ncols2>
<<<blocks_num_combine, block_dim_combine, 0, main_stream>>>
((float *) KQV->data, dst_tmp_meta.ptr, Q->ne[1], Q->ne[2], Q->ne[3], K->ne[1], K->ne[2], nbatch_fa);
((float *) KQV->data, dst_tmp_meta.ptr,
Q->ne[1], Q->ne[2], gqa_ratio, total_work,
fd_k_j_z_ne12, fd_k_j_z, fd_k_j, fd_k);
}
} else if (parallel_blocks > 1) {
const dim3 block_dim_combine(DV, 1, 1);
+76 -175
View File
@@ -82,7 +82,6 @@
#include <cstdlib>
#include <string>
#include <vector>
#include <unordered_set>
static_assert(sizeof(half) == sizeof(ggml_fp16_t), "wrong fp16 size");
@@ -2969,74 +2968,6 @@ static bool ggml_cuda_graph_check_compability(ggml_cgraph * cgraph) {
return use_cuda_graph;
}
static void ggml_cuda_graph_node_set_properties(ggml_cuda_graph_node_properties * props, ggml_tensor * node) {
memset(props, 0, sizeof(ggml_cuda_graph_node_properties));
props->node_data = node->data;
props->node_op = node->op;
props->node_type = node->type;
props->flags = node->flags;
for (int i = 0; i < GGML_MAX_DIMS; i++) {
props->ne[i] = node->ne[i];
props->nb[i] = node->nb[i];
}
for (int i = 0; i < GGML_MAX_SRC; i++) {
if (!node->src[i]) {
continue;
}
props->src_data[i] = node->src[i]->data;
}
memcpy(props->op_params, node->op_params, GGML_MAX_OP_PARAMS);
}
static bool ggml_cuda_graph_node_properties_match(ggml_tensor * node, ggml_cuda_graph_node_properties * props) {
if (node->data != props->node_data && node->op != GGML_OP_VIEW) {
return false;
}
if (node->op != props->node_op) {
return false;
}
if (node->type != props->node_type) {
return false;
}
for (int i = 0; i < GGML_MAX_DIMS; i++) {
if (node->ne[i] != props->ne[i]) {
return false;
}
if (node->nb[i] != props->nb[i]) {
return false;
}
}
if (node->op != GGML_OP_VIEW) {
for (int i = 0; i < GGML_MAX_SRC; i++) {
if (!node->src[i]) {
if (props->src_data[i] != nullptr) {
return false;
}
continue;
}
if (node->src[i]->data != props->src_data[i]) {
return false;
}
}
}
if (memcmp(props->op_params, node->op_params, GGML_MAX_OP_PARAMS) != 0) {
return false;
}
if ((node->flags & GGML_TENSOR_FLAG_COMPUTE) != (props->flags & GGML_TENSOR_FLAG_COMPUTE)) {
return false;
}
return true;
}
static const void * ggml_cuda_graph_get_key(ggml_cgraph * cgraph) {
return cgraph->nodes[0];
}
@@ -3048,52 +2979,18 @@ static bool ggml_cuda_graph_update_required(ggml_backend_cuda_context * cuda_ctx
ggml_cuda_graph * graph = cuda_ctx->cuda_graph(graph_key);
// Check if the graph size has changed
if (graph->props.size() != (size_t)cgraph->n_nodes) {
if ((int)graph->nodes_copy.size() != cgraph->n_nodes) {
res = true;
graph->props.resize(cgraph->n_nodes);
graph->nodes_copy.resize(cgraph->n_nodes);
}
// Loop over nodes in GGML graph to determine if CUDA graph update is required
// and store properties to allow this comparison for the next token
std::unordered_set<ggml_tensor *> seen_node;
std::vector<ggml_tensor *> srcs_extra;
for (int i = 0; i < cgraph->n_nodes; i++) {
bool props_match = true;
seen_node.insert(cgraph->nodes[i]);
if (!res) {
props_match = ggml_cuda_graph_node_properties_match(cgraph->nodes[i], &graph->props[i]);
}
if (!props_match) {
res = true;
}
ggml_cuda_graph_node_set_properties(&graph->props[i], cgraph->nodes[i]);
for (int src_idx = 0; src_idx < GGML_MAX_SRC; ++src_idx) {
ggml_tensor * src = cgraph->nodes[i]->src[src_idx];
if (src && seen_node.find(src) == seen_node.end()) {
srcs_extra.push_back(src);
if (memcmp(&graph->nodes_copy[i], cgraph->nodes[i], sizeof(ggml_tensor)) != 0) {
res = true;
}
}
}
if (graph->extra.size() != (size_t) srcs_extra.size()) {
res = true;
graph->extra.resize(srcs_extra.size());
}
for (size_t i = 0; i < srcs_extra.size(); ++i) {
bool props_match = true;
if (!res) {
props_match = ggml_cuda_graph_node_properties_match(srcs_extra[i], &graph->extra[i]);
}
if (!props_match) {
res = true;
}
ggml_cuda_graph_node_set_properties(&graph->extra[i], srcs_extra[i]);
memcpy(&graph->nodes_copy[i], cgraph->nodes[i], sizeof(ggml_tensor));
}
return res;
@@ -3308,6 +3205,71 @@ static bool ggml_cuda_topk_moe_fusion(const struct ggml_cgraph * cgraph, int nod
return true;
}
// returns whether the write (out) nodes overwrite the read nodes in operation
static bool ggml_cuda_check_fusion_memory_ranges(const ggml_cgraph * cgraph,
const int node_idx,
const int node_count,
const int * out_nodes,
const int out_count,
const bool is_topk_moe = false) {
auto nodes_overlap = [&](const ggml_tensor * a, const ggml_tensor * b) {
const int64_t a_start = (int64_t) a->data;
const int64_t a_end = a_start + ggml_backend_buft_get_alloc_size(a->buffer->buft, a);
const int64_t b_start = (int64_t) b->data;
const int64_t b_end = b_start + ggml_backend_buft_get_alloc_size(b->buffer->buft, b);
if ((b_start <= a_start && a_start < b_end) || (a_start <= b_start && b_start < a_end)) {
return true;
}
return false;
};
bool is_ok = true;
// exception for topk-moe, as each row is read entirely before writing
if (ggml_nrows(cgraph->nodes[node_idx]) == 1 && is_topk_moe) {
return true;
}
for (int i = 0; i < out_count; ++i) {
const ggml_tensor * dst = cgraph->nodes[out_nodes[i]];
for (int j = node_idx; j < node_idx + node_count; ++j) {
// Loop over all srcs of all nodes in the fusion. If the src overlaps
// the destination and the src is not an intermediate node that's being
// elided, then disable fusion.
for (int src_idx = 0; src_idx < GGML_MAX_SRC; ++src_idx) {
const ggml_tensor * src = cgraph->nodes[j]->src[src_idx];
if (!src || src->op == GGML_OP_NONE) {
continue;
}
if (nodes_overlap(dst, src)) {
bool found = false;
for (int k = node_idx; k < j; ++k) {
if (cgraph->nodes[k] == src) {
found = true;
break;
}
}
if (!found) {
is_ok = false;
break;
}
}
}
}
}
return is_ok;
}
static bool ggml_cuda_can_fuse(const struct ggml_cgraph * cgraph,
int node_idx,
std::initializer_list<enum ggml_op> ops,
@@ -3337,7 +3299,8 @@ static bool ggml_cuda_can_fuse(const struct ggml_cgraph * cgraph,
const ggml_tensor * glu = cgraph->nodes[node_idx + 4];
if (ggml_cuda_should_fuse_mul_mat(ffn_up, ffn_gate, glu, ffn_up_bias, ffn_gate_bias)) {
return true;
int out_nodes[] = { node_idx + 4 };
return ggml_cuda_check_fusion_memory_ranges(cgraph, node_idx, (int)ops.size(), out_nodes, 1);
}
}
@@ -3348,7 +3311,8 @@ static bool ggml_cuda_can_fuse(const struct ggml_cgraph * cgraph,
const ggml_tensor * glu = cgraph->nodes[node_idx + 2];
if (ggml_cuda_should_fuse_mul_mat(ffn_up, ffn_gate, glu)) {
return true;
int out_nodes[] = { node_idx + 2 };
return ggml_cuda_check_fusion_memory_ranges(cgraph, node_idx, (int)ops.size(), out_nodes, 1);
}
}
@@ -3474,69 +3438,6 @@ static bool ggml_cuda_can_fuse(const struct ggml_cgraph * cgraph,
return false;
}
// returns whether the write (out) nodes overwrite the read nodes in operation
static bool ggml_cuda_check_fusion_memory_ranges(ggml_cgraph * cgraph,
int node_idx,
int node_count,
int * out_nodes,
int out_count) {
auto nodes_overlap = [&](const ggml_tensor * a, const ggml_tensor * b) {
const int64_t a_start = (int64_t) a->data;
const int64_t a_end = a_start + ggml_nbytes(a);
const int64_t b_start = (int64_t) b->data;
const int64_t b_end = b_start + ggml_nbytes(b);
if ((b_start <= a_start && a_start < b_end) || (a_start <= b_start && b_start < a_end)) {
return true;
}
return false;
};
bool is_ok = true;
// for nrows=1, all fusion operations correctly read the src before writing dst or do it elementwise, so we should be ok
if (ggml_nrows(cgraph->nodes[node_idx]) == 1) {
return true;
}
for (int i = 0; i < out_count; ++i) {
const ggml_tensor * dst = cgraph->nodes[out_nodes[i]];
for (int j = node_idx; j < node_idx + node_count; ++j) {
// Loop over all srcs of all nodes in the fusion. If the src overlaps
// the destination and the src is not an intermediate node that's being
// elided, then disable fusion.
for (int src_idx = 0; src_idx < GGML_MAX_SRC; ++src_idx) {
const ggml_tensor * src = cgraph->nodes[j]->src[src_idx];
if (!src || src->op == GGML_OP_NONE) {
continue;
}
if (nodes_overlap(dst, src)) {
bool found = false;
for (int k = node_idx; k < j; ++k) {
if (cgraph->nodes[k] == src) {
found = true;
break;
}
}
if (!found) {
is_ok = false;
break;
}
}
}
}
}
return is_ok;
}
static void ggml_cuda_graph_evaluate_and_capture(ggml_backend_cuda_context * cuda_ctx, ggml_cgraph * cgraph, const bool use_cuda_graph, const bool cuda_graph_update_required, const void * graph_key) {
bool graph_evaluated_or_captured = false;
@@ -3734,7 +3635,7 @@ static void ggml_cuda_graph_evaluate_and_capture(ggml_backend_cuda_context * cud
if (ggml_can_fuse_subgraph(cgraph, i, ops.size(), ops.data(), out_nodes, 2) &&
ggml_cuda_should_use_topk_moe(node, logits, weights, ids) &&
ggml_cuda_check_fusion_memory_ranges(cgraph, i, ops.size(), out_nodes, 2)) {
ggml_cuda_check_fusion_memory_ranges(cgraph, i, ops.size(), out_nodes, 2, /*is_topk_moe=*/ true)) {
ggml_cuda_op_topk_moe(*cuda_ctx, logits, weights, ids, clamp, scale, bias, args);
i += ops.size() - 1;
continue;
@@ -3750,7 +3651,7 @@ static void ggml_cuda_graph_evaluate_and_capture(ggml_backend_cuda_context * cud
int out_nodes[2] = { i + 1, i + 5 };
if (ggml_can_fuse_subgraph(cgraph, i, ops.size(), ops.data(), out_nodes, 2) &&
ggml_cuda_should_use_topk_moe(softmax, logits, weights, ids) &&
ggml_cuda_check_fusion_memory_ranges(cgraph, i, ops.size(), out_nodes, 2)) {
ggml_cuda_check_fusion_memory_ranges(cgraph, i, ops.size(), out_nodes, 2, /*is_topk_moe=*/ true)) {
ggml_cuda_op_topk_moe(*cuda_ctx, logits, weights, ids, clamp, scale, bias, args);
i += ops.size() - 1;
continue;
+27 -10
View File
@@ -386,17 +386,25 @@ static __device__ __forceinline__ void vec_dot_q4_0_q8_1_dp4a(
#pragma unroll
for (int i0 = 0; i0 < mmq_y; i0 += warp_size) {
const int i = i0 + threadIdx.x;
const int kyqs = QI8_1 * ((k01/2) / (QI8_1/2)) + (k01/2) % (QI8_1/2);
int u[2*VDR_Q4_0_Q8_1_MMQ];
#pragma unroll
for (int l = 0; l < VDR_Q4_0_Q8_1_MMQ; ++l) {
u[2*l+0] = y_qs[j*MMQ_TILE_Y_K + kyqs + l];
u[2*l+1] = y_qs[j*MMQ_TILE_Y_K + kyqs + (l + QI4_0)];
constexpr int max_cpy = ggml_cuda_get_max_cpy_bytes();
constexpr int mcpy_int = max_cpy / sizeof(int);
static_assert(VDR_Q4_0_Q8_1_MMQ == 4, "bad VDR_Q4_0_Q8_1_MMQ");
int tmp0[4], tmp1[4];
#pragma unroll
for (int l0 = 0; l0 < 4 / mcpy_int; ++l0) {
ggml_cuda_memcpy_1<max_cpy>(tmp0 + l0 * mcpy_int, &y_qs[j*MMQ_TILE_Y_K + kyqs + l0 * mcpy_int] );
ggml_cuda_memcpy_1<max_cpy>(tmp1 + l0 * mcpy_int, &y_qs[j*MMQ_TILE_Y_K + kyqs + QI4_0 + l0 * mcpy_int]);
}
u[0]=tmp0[0]; u[2]=tmp0[1]; u[4]=tmp0[2]; u[6]=tmp0[3];
u[1]=tmp1[0]; u[3]=tmp1[1]; u[5]=tmp1[2]; u[7]=tmp1[3];
sum[j0/nwarps*mmq_y/warp_size + i0/warp_size] += vec_dot_q4_0_q8_1_impl<VDR_Q4_0_Q8_1_MMQ>
(&x_qs[i*(MMQ_TILE_NE_K + 1) + k0/QR4_0], u,
x_df[i*(MMQ_TILE_NE_K/QI4_0) + i/QI4_0 + k0/(QR4_0*QI4_0)], y_ds[j*MMQ_TILE_Y_K + k01/QI8_1]);
@@ -489,17 +497,25 @@ static __device__ __forceinline__ void vec_dot_q4_1_q8_1_dp4a(
#pragma unroll
for (int i0 = 0; i0 < mmq_y; i0 += warp_size) {
const int i = i0 + threadIdx.x;
const int kyqs = QI8_1 * ((k01/2) / (QI8_1/2)) + (k01/2) % (QI8_1/2);
int u[2*VDR_Q4_1_Q8_1_MMQ];
#pragma unroll
for (int l = 0; l < VDR_Q4_1_Q8_1_MMQ; ++l) {
u[2*l+0] = y_qs[j*MMQ_TILE_Y_K + kyqs + l];
u[2*l+1] = y_qs[j*MMQ_TILE_Y_K + kyqs + (l + QI4_1)];
constexpr int max_cpy = ggml_cuda_get_max_cpy_bytes();
constexpr int mcpy_int = max_cpy / sizeof(int);
static_assert(VDR_Q4_0_Q8_1_MMQ == 4, "bad VDR_Q4_0_Q8_1_MMQ");
int tmp0[4], tmp1[4];
#pragma unroll
for (int l0 = 0; l0 < 4 / mcpy_int; ++l0) {
ggml_cuda_memcpy_1<max_cpy>(tmp0 + l0 * mcpy_int, &y_qs[j*MMQ_TILE_Y_K + kyqs + l0 * mcpy_int] );
ggml_cuda_memcpy_1<max_cpy>(tmp1 + l0 * mcpy_int, &y_qs[j*MMQ_TILE_Y_K + kyqs + QI4_1 + l0 * mcpy_int]);
}
u[0]=tmp0[0]; u[2]=tmp0[1]; u[4]=tmp0[2]; u[6]=tmp0[3];
u[1]=tmp1[0]; u[3]=tmp1[1]; u[5]=tmp1[2]; u[7]=tmp1[3];
sum[j0/nwarps*mmq_y/warp_size + i0/warp_size] += vec_dot_q4_1_q8_1_impl<VDR_Q4_1_Q8_1_MMQ>
(&x_qs[i*(MMQ_TILE_NE_K + 1) + k0/QR4_1], u,
x_dm[i*(MMQ_TILE_NE_K/QI4_1) + i/QI4_1 + k0/(QR4_1*QI4_1)], y_ds[j*MMQ_TILE_Y_K + k01/QI8_1]);
@@ -4170,3 +4186,4 @@ void ggml_cuda_op_mul_mat_q(
const int64_t src1_padded_row_size, cudaStream_t stream);
bool ggml_cuda_should_use_mmq(enum ggml_type type, int cc, int64_t ne11, int64_t n_experts);
+15 -3
View File
@@ -164,6 +164,12 @@ static void quicksort_values_indices_desc(float * values, int32_t * indices, int
if (i < right) quicksort_values_indices_desc(values, indices, i, right);
}
// LUT for ramp initialization of argsort output (first 32 members)
int32_t argosrt_ramp_lut[32] __attribute__((aligned(VLEN))) = {
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
};
static void htp_argsort_f32(unsigned int n, unsigned int i, void * data) {
struct htp_argsort_context * actx = (struct htp_argsort_context *)data;
struct htp_ops_context * octx = actx->octx;
@@ -205,8 +211,12 @@ static void htp_argsort_f32(unsigned int n, unsigned int i, void * data) {
// Padded to 128 bytes.
size_t values_size = hex_round_up(ne00 * sizeof(float), 128);
size_t num_vec_ind_values = hmx_ceil_div(ne00, VLEN/(sizeof(int32_t)));
float * values_buf = (float *) spad;
int32_t * indices_buf = (int32_t *) (spad + values_size);
HVX_Vector * indices_buf_vec = (HVX_Vector *) (spad + values_size);
const HVX_Vector ind_init_vec = *(HVX_Vector *)argosrt_ramp_lut;
const HVX_Vector ind_diff_vec = Q6_V_vsplat_R(32);
for (uint32_t r = start_row; r < end_row; r++) {
uint32_t src_offset = r * nb01;
@@ -218,9 +228,11 @@ static void htp_argsort_f32(unsigned int n, unsigned int i, void * data) {
hex_l2fetch(src_ptr, ne00 * sizeof(float), ne00 * sizeof(float), 1);
hvx_copy_f32_au((uint8_t*)values_buf, src_ptr, ne00);
// Initialize indices
for (uint32_t j = 0; j < ne00; j++) {
indices_buf[j] = j;
// Initialize indices - Start with values 0..31, add 32 for additional vec iterations
HVX_Vector curr_ind_vec = ind_init_vec;
for (uint32_t j_vec = 0; j_vec < num_vec_ind_values; j_vec++) {
indices_buf_vec[j_vec] = curr_ind_vec;
curr_ind_vec = Q6_Vw_vadd_VwVw(curr_ind_vec, ind_diff_vec);
}
// Sort values and mirror swaps to indices
+10
View File
@@ -736,6 +736,11 @@ ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_mul_mv(ggml_meta
suffix = ne00 % 4 == 0 ? "_4" : "";
}
} break;
case GGML_TYPE_Q1_0:
{
nsg = N_SG_Q1_0;
nr0 = N_R0_Q1_0;
} break;
case GGML_TYPE_Q4_0:
{
nsg = N_SG_Q4_0;
@@ -948,6 +953,11 @@ ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_mul_mv_id(ggml_m
smem = 32*sizeof(float)*nr0;
suffix = ne00 % 4 == 0 ? "_4" : "";
} break;
case GGML_TYPE_Q1_0:
{
nsg = N_SG_Q1_0;
nr0 = N_R0_Q1_0;
} break;
case GGML_TYPE_Q4_0:
{
nsg = N_SG_Q4_0;
+2
View File
@@ -1184,6 +1184,7 @@ bool ggml_metal_device_supports_op(ggml_metal_device_t dev, const struct ggml_te
case GGML_TYPE_F16:
case GGML_TYPE_BF16:
case GGML_TYPE_Q8_0:
case GGML_TYPE_Q1_0:
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1:
case GGML_TYPE_Q5_0:
@@ -1210,6 +1211,7 @@ bool ggml_metal_device_supports_op(ggml_metal_device_t dev, const struct ggml_te
default:
return false;
}
case GGML_TYPE_Q1_0:
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1:
case GGML_TYPE_Q5_0:
+3
View File
@@ -8,6 +8,9 @@
//
// TODO: for optimal performance, become function of the device and work size
#define N_R0_Q1_0 8
#define N_SG_Q1_0 2
#define N_R0_Q4_0 4
#define N_SG_Q4_0 2
+1
View File
@@ -2047,6 +2047,7 @@ int ggml_metal_op_mul_mat(ggml_metal_op_t ctx, int idx) {
op->src[0]->type == GGML_TYPE_F32 || // TODO: helper function
op->src[0]->type == GGML_TYPE_F16 ||
op->src[0]->type == GGML_TYPE_BF16 ||
op->src[0]->type == GGML_TYPE_Q1_0 ||
op->src[0]->type == GGML_TYPE_Q4_0 ||
op->src[0]->type == GGML_TYPE_Q4_1 ||
op->src[0]->type == GGML_TYPE_Q5_0 ||
+187
View File
@@ -118,6 +118,56 @@ void dequantize_bf16_t4(device const bfloat4 * src, short il, thread type4 & reg
}
#endif
template <typename type4x4>
void dequantize_q1_0(device const block_q1_0 * xb, short il, thread type4x4 & reg) {
device const uint8_t * qs = xb->qs;
const float d = xb->d;
const float neg_d = -d;
const int byte_offset = il * 2; // il*16 bits = il*2 bytes
const uint8_t b0 = qs[byte_offset];
const uint8_t b1 = qs[byte_offset + 1];
float4x4 reg_f;
reg_f[0][0] = select(neg_d, d, bool(b0 & 0x01));
reg_f[0][1] = select(neg_d, d, bool(b0 & 0x02));
reg_f[0][2] = select(neg_d, d, bool(b0 & 0x04));
reg_f[0][3] = select(neg_d, d, bool(b0 & 0x08));
reg_f[1][0] = select(neg_d, d, bool(b0 & 0x10));
reg_f[1][1] = select(neg_d, d, bool(b0 & 0x20));
reg_f[1][2] = select(neg_d, d, bool(b0 & 0x40));
reg_f[1][3] = select(neg_d, d, bool(b0 & 0x80));
reg_f[2][0] = select(neg_d, d, bool(b1 & 0x01));
reg_f[2][1] = select(neg_d, d, bool(b1 & 0x02));
reg_f[2][2] = select(neg_d, d, bool(b1 & 0x04));
reg_f[2][3] = select(neg_d, d, bool(b1 & 0x08));
reg_f[3][0] = select(neg_d, d, bool(b1 & 0x10));
reg_f[3][1] = select(neg_d, d, bool(b1 & 0x20));
reg_f[3][2] = select(neg_d, d, bool(b1 & 0x40));
reg_f[3][3] = select(neg_d, d, bool(b1 & 0x80));
reg = (type4x4) reg_f;
}
template <typename type4>
void dequantize_q1_0_t4(device const block_q1_0 * xb, short il, thread type4 & reg) {
const float d = xb->d;
const float neg_d = -d;
const int base = il * 4;
const uint8_t byte = xb->qs[base / 8];
const int s = base % 8;
float4 reg_f;
reg_f[0] = select(neg_d, d, bool((byte >> (s )) & 1));
reg_f[1] = select(neg_d, d, bool((byte >> (s + 1)) & 1));
reg_f[2] = select(neg_d, d, bool((byte >> (s + 2)) & 1));
reg_f[3] = select(neg_d, d, bool((byte >> (s + 3)) & 1));
reg = (type4) reg_f;
}
template <typename type4x4>
void dequantize_q4_0(device const block_q4_0 * xb, short il, thread type4x4 & reg) {
device const uint16_t * qs = ((device const uint16_t *)xb + 1);
@@ -152,6 +202,23 @@ void dequantize_q4_0_t4(device const block_q4_0 * xb, short il, thread type4 & r
}
}
void quantize_q1_0(device const float * src, device block_q1_0 & dst) {
float sum_abs = 0.0f;
for (int j = 0; j < QK1_0; j++) {
sum_abs += fabs(src[j]);
}
dst.d = sum_abs / QK1_0;
for (int j = 0; j < QK1_0 / 8; j++) {
dst.qs[j] = 0;
}
for (int j = 0; j < QK1_0; j++) {
if (src[j] >= 0.0f) {
dst.qs[j / 8] |= (1 << (j % 8));
}
}
}
void quantize_q4_0(device const float * src, device block_q4_0 & dst) {
#pragma METAL fp math_mode(safe)
float amax = 0.0f; // absolute max
@@ -3116,6 +3183,35 @@ kernel void kernel_group_norm_f32(
}
}
// Q1_0 dot product: dot = d * (2 * Σ(yl[i] where bit=1) - sumy)
inline float block_q_n_dot_y(device const block_q1_0 * qb_curr, float sumy, thread float * yl, int il) {
device const uint8_t * qs = qb_curr->qs + il / 8;
const uint8_t b0 = qs[0];
const uint8_t b1 = qs[1];
float acc = 0.0f;
acc += select(0.0f, yl[ 0], bool(b0 & 0x01));
acc += select(0.0f, yl[ 1], bool(b0 & 0x02));
acc += select(0.0f, yl[ 2], bool(b0 & 0x04));
acc += select(0.0f, yl[ 3], bool(b0 & 0x08));
acc += select(0.0f, yl[ 4], bool(b0 & 0x10));
acc += select(0.0f, yl[ 5], bool(b0 & 0x20));
acc += select(0.0f, yl[ 6], bool(b0 & 0x40));
acc += select(0.0f, yl[ 7], bool(b0 & 0x80));
acc += select(0.0f, yl[ 8], bool(b1 & 0x01));
acc += select(0.0f, yl[ 9], bool(b1 & 0x02));
acc += select(0.0f, yl[10], bool(b1 & 0x04));
acc += select(0.0f, yl[11], bool(b1 & 0x08));
acc += select(0.0f, yl[12], bool(b1 & 0x10));
acc += select(0.0f, yl[13], bool(b1 & 0x20));
acc += select(0.0f, yl[14], bool(b1 & 0x40));
acc += select(0.0f, yl[15], bool(b1 & 0x80));
return qb_curr->d * (2.0f * acc - sumy);
}
// function for calculate inner product between half a q4_0 block and 16 floats (yl), sumy is SUM(yl[i])
// il indicates where the q4 quants begin (0 or QK4_0/4)
// we assume that the yl's have been multiplied with the appropriate scale factor
@@ -3337,6 +3433,85 @@ void mul_vec_q_n_f32_impl(
}
}
template<int nr0, typename args_t>
void kernel_mul_mv_q1_0_f32_impl(
args_t args,
device const char * src0,
device const char * src1,
device char * dst,
threadgroup char * shmem,
uint3 tgpig,
ushort tiisg,
ushort sgitg) {
const short NSG = FC_mul_mv_nsg;
const int nb = args.ne00/QK1_0;
const int r0 = tgpig.x;
const int r1 = tgpig.y;
const int im = tgpig.z;
const int first_row = (r0 * NSG + sgitg) * nr0;
const uint i12 = im%args.ne12;
const uint i13 = im/args.ne12;
const uint64_t offset1 = r1*args.nb11 + (i12)*args.nb12 + (i13)*args.nb13;
device const float * y = (device const float *) (src1 + offset1);
device const block_q1_0 * ax[nr0];
for (int row = 0; row < nr0; ++row) {
const uint64_t offset0 = (first_row + row)*args.nb01 + (i12/args.r2)*args.nb02 + (i13/args.r3)*args.nb03;
ax[row] = (device const block_q1_0 *) ((device char *) src0 + offset0);
}
float yl[16];
float sumf[nr0] = {0.f};
const short ix = (tiisg/8);
const short il = (tiisg%8)*16;
device const float * yb = y + ix*QK1_0 + il;
for (int ib = ix; ib < nb; ib += N_SIMDWIDTH/8) {
float sumy = 0.f;
FOR_UNROLL (short i = 0; i < 16; i++) {
yl[i] = yb[i];
sumy += yb[i];
}
FOR_UNROLL (short row = 0; row < nr0; row++) {
sumf[row] += block_q_n_dot_y(ax[row] + ib, sumy, yl, il);
}
yb += QK1_0 * (N_SIMDWIDTH/8);
}
device float * dst_f32 = (device float *) dst + (uint64_t)im*args.ne0*args.ne1 + (uint64_t)r1*args.ne0;
for (int row = 0; row < nr0; ++row) {
const float tot = simd_sum(sumf[row]);
if (tiisg == 0 && first_row + row < args.ne01) {
dst_f32[first_row + row] = tot;
}
}
}
[[host_name("kernel_mul_mv_q1_0_f32")]]
kernel void kernel_mul_mv_q1_0_f32(
constant ggml_metal_kargs_mul_mv & args,
device const char * src0,
device const char * src1,
device char * dst,
uint3 tgpig[[threadgroup_position_in_grid]],
ushort tiisg[[thread_index_in_simdgroup]],
ushort sgitg[[simdgroup_index_in_threadgroup]]) {
kernel_mul_mv_q1_0_f32_impl<N_R0_Q1_0, constant ggml_metal_kargs_mul_mv &>(args, src0, src1, dst, nullptr, tgpig, tiisg, sgitg);
}
kernel void kernel_mul_mv_q4_0_f32(
constant ggml_metal_kargs_mul_mv & args,
device const char * src0,
@@ -3729,6 +3904,11 @@ template [[host_name("kernel_mul_mv_ext_bf16_f32_r1_4")]] kernel mul_mv_ext_q4
template [[host_name("kernel_mul_mv_ext_bf16_f32_r1_5")]] kernel mul_mv_ext_q4_f32_t kernel_mul_mv_ext_q4_f32_disp<5, bfloat4, 4, dequantize_bf16_t4>;
#endif
template [[host_name("kernel_mul_mv_ext_q1_0_f32_r1_2")]] kernel mul_mv_ext_q4_f32_t kernel_mul_mv_ext_q4_f32_disp<2, block_q1_0, 128, dequantize_q1_0_t4>;
template [[host_name("kernel_mul_mv_ext_q1_0_f32_r1_3")]] kernel mul_mv_ext_q4_f32_t kernel_mul_mv_ext_q4_f32_disp<3, block_q1_0, 128, dequantize_q1_0_t4>;
template [[host_name("kernel_mul_mv_ext_q1_0_f32_r1_4")]] kernel mul_mv_ext_q4_f32_t kernel_mul_mv_ext_q4_f32_disp<4, block_q1_0, 128, dequantize_q1_0_t4>;
template [[host_name("kernel_mul_mv_ext_q1_0_f32_r1_5")]] kernel mul_mv_ext_q4_f32_t kernel_mul_mv_ext_q4_f32_disp<5, block_q1_0, 128, dequantize_q1_0_t4>;
template [[host_name("kernel_mul_mv_ext_q4_0_f32_r1_2")]] kernel mul_mv_ext_q4_f32_t kernel_mul_mv_ext_q4_f32_disp<2, block_q4_0, 32, dequantize_q4_0_t4>;
template [[host_name("kernel_mul_mv_ext_q4_0_f32_r1_3")]] kernel mul_mv_ext_q4_f32_t kernel_mul_mv_ext_q4_f32_disp<3, block_q4_0, 32, dequantize_q4_0_t4>;
template [[host_name("kernel_mul_mv_ext_q4_0_f32_r1_4")]] kernel mul_mv_ext_q4_f32_t kernel_mul_mv_ext_q4_f32_disp<4, block_q4_0, 32, dequantize_q4_0_t4>;
@@ -7133,6 +7313,7 @@ kernel void kernel_cpy_f32_q(
typedef decltype(kernel_cpy_f32_q<QK8_0, block_q8_0, quantize_q8_0>) cpy_f_q_t;
template [[host_name("kernel_cpy_f32_q8_0")]] kernel cpy_f_q_t kernel_cpy_f32_q<QK8_0, block_q8_0, quantize_q8_0>;
template [[host_name("kernel_cpy_f32_q1_0")]] kernel cpy_f_q_t kernel_cpy_f32_q<QK1_0, block_q1_0, quantize_q1_0>;
template [[host_name("kernel_cpy_f32_q4_0")]] kernel cpy_f_q_t kernel_cpy_f32_q<QK4_0, block_q4_0, quantize_q4_0>;
template [[host_name("kernel_cpy_f32_q4_1")]] kernel cpy_f_q_t kernel_cpy_f32_q<QK4_1, block_q4_1, quantize_q4_1>;
template [[host_name("kernel_cpy_f32_q5_0")]] kernel cpy_f_q_t kernel_cpy_f32_q<QK5_0, block_q5_0, quantize_q5_0>;
@@ -7173,12 +7354,14 @@ kernel void kernel_cpy_q_f32(
typedef decltype(kernel_cpy_q_f32<float4x4, block_q4_0, 2, dequantize_q4_0>) cpy_q_f_t;
template [[host_name("kernel_cpy_q1_0_f32")]] kernel cpy_q_f_t kernel_cpy_q_f32<float4x4, block_q1_0, 8, dequantize_q1_0>;
template [[host_name("kernel_cpy_q4_0_f32")]] kernel cpy_q_f_t kernel_cpy_q_f32<float4x4, block_q4_0, 2, dequantize_q4_0>;
template [[host_name("kernel_cpy_q4_1_f32")]] kernel cpy_q_f_t kernel_cpy_q_f32<float4x4, block_q4_1, 2, dequantize_q4_1>;
template [[host_name("kernel_cpy_q5_0_f32")]] kernel cpy_q_f_t kernel_cpy_q_f32<float4x4, block_q5_0, 2, dequantize_q5_0>;
template [[host_name("kernel_cpy_q5_1_f32")]] kernel cpy_q_f_t kernel_cpy_q_f32<float4x4, block_q5_1, 2, dequantize_q5_1>;
template [[host_name("kernel_cpy_q8_0_f32")]] kernel cpy_q_f_t kernel_cpy_q_f32<float4x4, block_q8_0, 2, dequantize_q8_0>;
template [[host_name("kernel_cpy_q1_0_f16")]] kernel cpy_q_f_t kernel_cpy_q_f32<half4x4, block_q1_0, 8, dequantize_q1_0>;
template [[host_name("kernel_cpy_q4_0_f16")]] kernel cpy_q_f_t kernel_cpy_q_f32<half4x4, block_q4_0, 2, dequantize_q4_0>;
template [[host_name("kernel_cpy_q4_1_f16")]] kernel cpy_q_f_t kernel_cpy_q_f32<half4x4, block_q4_1, 2, dequantize_q4_1>;
template [[host_name("kernel_cpy_q5_0_f16")]] kernel cpy_q_f_t kernel_cpy_q_f32<half4x4, block_q5_0, 2, dequantize_q5_0>;
@@ -9776,6 +9959,7 @@ template [[host_name("kernel_get_rows_bf16")]] kernel get_rows_f_t kernel_get_ro
typedef decltype(kernel_get_rows_q<block_q4_0, 2, dequantize_q4_0>) get_rows_q_t;
template [[host_name("kernel_get_rows_q1_0")]] kernel get_rows_q_t kernel_get_rows_q<block_q1_0, 8, dequantize_q1_0>;
template [[host_name("kernel_get_rows_q4_0")]] kernel get_rows_q_t kernel_get_rows_q<block_q4_0, 2, dequantize_q4_0>;
template [[host_name("kernel_get_rows_q4_1")]] kernel get_rows_q_t kernel_get_rows_q<block_q4_1, 2, dequantize_q4_1>;
template [[host_name("kernel_get_rows_q5_0")]] kernel get_rows_q_t kernel_get_rows_q<block_q5_0, 2, dequantize_q5_0>;
@@ -9838,6 +10022,7 @@ template [[host_name("kernel_mul_mm_f16_f32")]] kernel mul_mm_t kernel_mul_m
#if defined(GGML_METAL_HAS_BF16)
template [[host_name("kernel_mul_mm_bf16_f32")]] kernel mul_mm_t kernel_mul_mm<bfloat, bfloat4x4, simdgroup_bfloat8x8, bfloat, bfloat2x4, simdgroup_bfloat8x8, bfloat4x4, 1, dequantize_bf16, bfloat, bfloat4x4, float, float2x4>;
#endif
template [[host_name("kernel_mul_mm_q1_0_f32")]] kernel mul_mm_t kernel_mul_mm<half, half4x4, simdgroup_half8x8, half, half2x4, simdgroup_half8x8, block_q1_0, 8, dequantize_q1_0, float, float4x4, float, float2x4>;
template [[host_name("kernel_mul_mm_q4_0_f32")]] kernel mul_mm_t kernel_mul_mm<half, half4x4, simdgroup_half8x8, half, half2x4, simdgroup_half8x8, block_q4_0, 2, dequantize_q4_0, float, float4x4, float, float2x4>;
template [[host_name("kernel_mul_mm_q4_1_f32")]] kernel mul_mm_t kernel_mul_mm<half, half4x4, simdgroup_half8x8, half, half2x4, simdgroup_half8x8, block_q4_1, 2, dequantize_q4_1, float, float4x4, float, float2x4>;
template [[host_name("kernel_mul_mm_q5_0_f32")]] kernel mul_mm_t kernel_mul_mm<half, half4x4, simdgroup_half8x8, half, half2x4, simdgroup_half8x8, block_q5_0, 2, dequantize_q5_0, float, float4x4, float, float2x4>;
@@ -9861,6 +10046,7 @@ template [[host_name("kernel_mul_mm_iq4_xs_f32")]] kernel mul_mm_t kernel_mul_m
template [[host_name("kernel_mul_mm_f32_f16")]] kernel mul_mm_t kernel_mul_mm<half, half4x4, simdgroup_half8x8, half, half2x4, simdgroup_half8x8, float4x4, 1, dequantize_f32, float, float4x4, half, half2x4>;
template [[host_name("kernel_mul_mm_f16_f16")]] kernel mul_mm_t kernel_mul_mm<half, half4x4, simdgroup_half8x8, half, half2x4, simdgroup_half8x8, half4x4, 1, dequantize_f16, half, half4x4, half, half2x4>;
template [[host_name("kernel_mul_mm_q1_0_f16")]] kernel mul_mm_t kernel_mul_mm<half, half4x4, simdgroup_half8x8, half, half2x4, simdgroup_half8x8, block_q1_0, 8, dequantize_q1_0, float, float4x4, half, half2x4>;
template [[host_name("kernel_mul_mm_q4_0_f16")]] kernel mul_mm_t kernel_mul_mm<half, half4x4, simdgroup_half8x8, half, half2x4, simdgroup_half8x8, block_q4_0, 2, dequantize_q4_0, float, float4x4, half, half2x4>;
template [[host_name("kernel_mul_mm_q4_1_f16")]] kernel mul_mm_t kernel_mul_mm<half, half4x4, simdgroup_half8x8, half, half2x4, simdgroup_half8x8, block_q4_1, 2, dequantize_q4_1, float, float4x4, half, half2x4>;
template [[host_name("kernel_mul_mm_q5_0_f16")]] kernel mul_mm_t kernel_mul_mm<half, half4x4, simdgroup_half8x8, half, half2x4, simdgroup_half8x8, block_q5_0, 2, dequantize_q5_0, float, float4x4, half, half2x4>;
@@ -10070,6 +10256,7 @@ template [[host_name("kernel_mul_mv_id_bf16_f32_4")]] kernel kernel_mul_mv_id_4
template [[host_name("kernel_mul_mv_id_q8_0_f32")]] kernel kernel_mul_mv_id_t kernel_mul_mv_id<mmv_fn<kernel_mul_mv_q8_0_f32_impl<N_R0_Q8_0>>>;
template [[host_name("kernel_mul_mv_id_q1_0_f32")]] kernel kernel_mul_mv_id_t kernel_mul_mv_id<mmv_fn<kernel_mul_mv_q1_0_f32_impl<N_R0_Q1_0>>>;
template [[host_name("kernel_mul_mv_id_q4_0_f32")]] kernel kernel_mul_mv_id_t kernel_mul_mv_id<mmv_fn<mul_vec_q_n_f32_impl<block_q4_0, N_R0_Q4_0>>>;
template [[host_name("kernel_mul_mv_id_q4_1_f32")]] kernel kernel_mul_mv_id_t kernel_mul_mv_id<mmv_fn<mul_vec_q_n_f32_impl<block_q4_1, N_R0_Q4_1>>>;
template [[host_name("kernel_mul_mv_id_q5_0_f32")]] kernel kernel_mul_mv_id_t kernel_mul_mv_id<mmv_fn<mul_vec_q_n_f32_impl<block_q5_0, N_R0_Q5_0>>>;
+1
View File
@@ -589,6 +589,7 @@ void ggml_opt_free(ggml_opt_context_t opt_ctx) {
ggml_backend_buffer_free(opt_ctx->buf_cpu);
ggml_free(opt_ctx->ctx_static);
ggml_free(opt_ctx->ctx_cpu);
ggml_free(opt_ctx->ctx_copy);
delete opt_ctx;
}
+75
View File
@@ -32,6 +32,41 @@ static inline int best_index_int8(int n, const int8_t * val, float x) {
return x - val[mu-1] < val[mu] - x ? mu-1 : mu;
}
// reference implementation for deterministic creation of model files
void quantize_row_q1_0_ref(const float * GGML_RESTRICT x, block_q1_0 * GGML_RESTRICT y, int64_t k) {
static const int qk = QK1_0;
assert(k % qk == 0);
const int nb = k / qk;
for (int i = 0; i < nb; i++) {
float sum_abs = 0.0f;
for (int j = 0; j < qk; j++) {
sum_abs += fabsf(x[i*qk + j]);
}
const float d = sum_abs / qk;
y[i].d = GGML_FP32_TO_FP16(d);
// Clear all bits first
for (int j = 0; j < qk / 8; ++j) {
y[i].qs[j] = 0;
}
// Just store sign of each weight directly (no normalization)
for (int j = 0; j < qk; ++j) {
const int bit_index = j;
const int byte_index = bit_index / 8;
const int bit_offset = bit_index % 8;
if (x[i*qk + j] >= 0.0f) {
y[i].qs[byte_index] |= (1 << bit_offset);
}
}
}
}
// reference implementation for deterministic creation of model files
void quantize_row_q4_0_ref(const float * GGML_RESTRICT x, block_q4_0 * GGML_RESTRICT y, int64_t k) {
static const int qk = QK4_0;
@@ -339,6 +374,26 @@ void quantize_row_nvfp4_ref(const float * GGML_RESTRICT x, block_nvfp4 * GGML_RE
}
}
void dequantize_row_q1_0(const block_q1_0 * GGML_RESTRICT x, float * GGML_RESTRICT y, int64_t k) {
static const int qk = QK1_0;
assert(k % qk == 0);
const int nb = k / qk;
for (int i = 0; i < nb; i++) {
const float d = GGML_FP16_TO_FP32(x[i].d);
const float neg_d = -d;
for (int j = 0; j < qk; ++j) {
const int byte_index = j / 8;
const int bit_offset = j % 8;
const uint8_t bit = (x[i].qs[byte_index] >> bit_offset) & 1;
y[i*qk + j] = bit ? d : neg_d;
}
}
}
void dequantize_row_q4_0(const block_q4_0 * GGML_RESTRICT x, float * GGML_RESTRICT y, int64_t k) {
static const int qk = QK4_0;
@@ -1978,6 +2033,22 @@ static void quantize_row_q4_0_impl(const float * GGML_RESTRICT x, block_q4_0 * G
}
}
size_t quantize_q1_0(const float * GGML_RESTRICT src, void * GGML_RESTRICT dst, int64_t nrow, int64_t n_per_row, const float * quant_weights) {
if (!quant_weights) {
quantize_row_q1_0_ref(src, dst, (int64_t)nrow*n_per_row);
return nrow * ggml_row_size(GGML_TYPE_Q1_0, n_per_row);
}
size_t row_size = ggml_row_size(GGML_TYPE_Q1_0, n_per_row);
char * qrow = (char *)dst;
for (int64_t row = 0; row < nrow; ++row) {
quantize_row_q1_0_ref(src, (block_q1_0*)qrow, n_per_row);
src += n_per_row;
qrow += row_size;
}
return nrow * row_size;
}
size_t quantize_q4_0(const float * GGML_RESTRICT src, void * GGML_RESTRICT dst, int64_t nrow, int64_t n_per_row, const float * quant_weights) {
if (!quant_weights) {
quantize_row_q4_0_ref(src, dst, (int64_t)nrow*n_per_row);
@@ -5286,6 +5357,10 @@ bool ggml_validate_row_data(enum ggml_type type, const void * data, size_t nbyte
}
}
} break;
case GGML_TYPE_Q1_0:
{
VALIDATE_ROW_DATA_D_F16_IMPL(block_q1_0, data, nb);
} break;
case GGML_TYPE_Q4_0:
{
VALIDATE_ROW_DATA_D_F16_IMPL(block_q4_0, data, nb);
+3
View File
@@ -14,6 +14,7 @@ extern "C" {
// NOTE: these functions are defined as GGML_API because they used by the CPU backend
// Quantization
GGML_API void quantize_row_q1_0_ref(const float * GGML_RESTRICT x, block_q1_0 * GGML_RESTRICT y, int64_t k);
GGML_API void quantize_row_q4_0_ref(const float * GGML_RESTRICT x, block_q4_0 * GGML_RESTRICT y, int64_t k);
GGML_API void quantize_row_q4_1_ref(const float * GGML_RESTRICT x, block_q4_1 * GGML_RESTRICT y, int64_t k);
GGML_API void quantize_row_q5_0_ref(const float * GGML_RESTRICT x, block_q5_0 * GGML_RESTRICT y, int64_t k);
@@ -41,6 +42,7 @@ GGML_API void quantize_row_iq3_s_ref (const float * GGML_RESTRICT x, block_iq3_
GGML_API void quantize_row_iq2_s_ref (const float * GGML_RESTRICT x, block_iq2_s * GGML_RESTRICT y, int64_t k);
// Dequantization
GGML_API void dequantize_row_q1_0(const block_q1_0 * GGML_RESTRICT x, float * GGML_RESTRICT y, int64_t k);
GGML_API void dequantize_row_q4_0(const block_q4_0 * GGML_RESTRICT x, float * GGML_RESTRICT y, int64_t k);
GGML_API void dequantize_row_q4_1(const block_q4_1 * GGML_RESTRICT x, float * GGML_RESTRICT y, int64_t k);
GGML_API void dequantize_row_q5_0(const block_q5_0 * GGML_RESTRICT x, float * GGML_RESTRICT y, int64_t k);
@@ -90,6 +92,7 @@ GGML_API size_t quantize_q3_K(const float * GGML_RESTRICT src, void * GGML_RESTR
GGML_API size_t quantize_q4_K(const float * GGML_RESTRICT src, void * GGML_RESTRICT dst, int64_t nrows, int64_t n_per_row, const float * imatrix);
GGML_API size_t quantize_q5_K(const float * GGML_RESTRICT src, void * GGML_RESTRICT dst, int64_t nrows, int64_t n_per_row, const float * imatrix);
GGML_API size_t quantize_q6_K(const float * GGML_RESTRICT src, void * GGML_RESTRICT dst, int64_t nrows, int64_t n_per_row, const float * imatrix);
GGML_API size_t quantize_q1_0(const float * GGML_RESTRICT src, void * GGML_RESTRICT dst, int64_t nrows, int64_t n_per_row, const float * imatrix);
GGML_API size_t quantize_q4_0(const float * GGML_RESTRICT src, void * GGML_RESTRICT dst, int64_t nrows, int64_t n_per_row, const float * imatrix);
GGML_API size_t quantize_q4_1(const float * GGML_RESTRICT src, void * GGML_RESTRICT dst, int64_t nrows, int64_t n_per_row, const float * imatrix);
GGML_API size_t quantize_q5_0(const float * GGML_RESTRICT src, void * GGML_RESTRICT dst, int64_t nrows, int64_t n_per_row, const float * imatrix);
+16
View File
@@ -143,6 +143,22 @@ static __dpct_inline__ void dequantize_q5_1(const void *vx, const int64_t ib,
#endif // GGML_SYCL_F16
}
static __dpct_inline__ void dequantize_q8_0_reorder(const void *d_ptr, const int64_t ib, const void *qs,
const int iqs, dfloat2 &v) {
const dfloat d = (const dfloat)*((const sycl::half*)d_ptr + ib);
v.x() = ((const int8_t *)qs)[iqs + 0];
v.y() = ((const int8_t *)qs)[iqs + 1];
#ifdef GGML_SYCL_F16
v.s0() *= d;
v.s1() *= d;
#else
v.x() *= d;
v.y() *= d;
#endif // GGML_SYCL_F16
}
static __dpct_inline__ void dequantize_q8_0(const void *vx, const int64_t ib,
const int iqs, dfloat2 &v) {
const block_q8_0 * x = (const block_q8_0 *) vx;
+103 -1
View File
@@ -972,6 +972,103 @@ static void dequantize_mul_mat_vec_q5_1_sycl(const void *vx, const dfloat *y,
}
}
static void dequantize_mul_mat_vec_q8_0_sycl_reorder(const void *vx, const dfloat *y,
float *dst, const int ncols,
const int nrows,
dpct::queue_ptr stream) {
GGML_ASSERT(ncols % GGML_SYCL_DMMV_X == 0);
const int block_num_y = (nrows + GGML_SYCL_MMV_Y - 1) / GGML_SYCL_MMV_Y;
const sycl::range<3> block_nums(1, 1, block_num_y);
const sycl::range<3> block_dims(1, GGML_SYCL_MMV_Y, WARP_SIZE);
{
dpct::has_capability_or_fail(stream->get_device(),
{sycl::aspect::fp16});
stream->parallel_for(
sycl::nd_range<3>(block_nums * block_dims, block_dims),
[=](sycl::nd_item<3> item_ct1) [[sycl::reqd_sub_group_size(WARP_SIZE)]] {
// Q8_0 reorder layout: [all qs (ncols*nrows bytes)][all d values]
// Cannot reuse dequantize_mul_mat_vec_reorder template because it has
// Q4_0-specific constants hardcoded (d_ptr offset and qs stride).
const int row = item_ct1.get_group(2) * item_ct1.get_local_range(1) +
item_ct1.get_local_id(1);
if (row >= nrows) return;
const int tid = item_ct1.get_local_id(2);
const int iter_stride = 8*2*GGML_SYCL_DMMV_X;
const int vals_per_iter = iter_stride / WARP_SIZE;
const int ncols_left = ncols % (QK8_0*WARP_SIZE);
const int ncols_align = ncols - ncols_left;
#ifdef GGML_SYCL_F16
sycl::half2 tmp = {0.0f, 0.0f};
#else
float tmp = 0.0f;
#endif
const char *d_ptr = (const char*)vx + ncols*nrows; // d after all qs
int i = 0;
for (i = 0; i < ncols_align; i += iter_stride) {
const int col = i + vals_per_iter*tid;
const int ib = (row*ncols + col)/QK8_0;
const int iqs = col % QK8_0;
#pragma unroll
for (int j = 0; j < vals_per_iter; j += 2) {
dfloat2 v;
dequantize_q8_0_reorder((const void *)d_ptr, ib, (const void *)vx,
ib * QK8_0 + iqs + j, v);
#ifdef GGML_SYCL_F16
dfloat2 t1{y[col + j + 0], y[col + j + 1]};
tmp += v * t1;
#else
tmp += v.x() * y[col + j + 0];
tmp += v.y() * y[col + j + 1];
#endif
}
}
// handle remaining columns
for (; i < ncols; i += iter_stride) {
if (tid >= ncols_left/QK8_0) continue;
const int col = i + vals_per_iter*tid;
const int ib = (row*ncols + col)/QK8_0;
const int iqs = col % QK8_0;
#pragma unroll
for (int j = 0; j < vals_per_iter; j += 2) {
dfloat2 v;
dequantize_q8_0_reorder((const void *)d_ptr, ib, (const void *)vx,
ib * QK8_0 + iqs + j, v);
#ifdef GGML_SYCL_F16
dfloat2 t1{y[col + j + 0], y[col + j + 1]};
tmp += v * t1;
#else
tmp += v.x() * y[col + j + 0];
tmp += v.y() * y[col + j + 1];
#endif
}
}
// reduce
const int mask_start = ncols > GGML_SYCL_DMMV_X ? WARP_SIZE >> 1 : WARP_SIZE >> 2;
for (int mask = mask_start; mask > 0; mask >>= 1) {
tmp += dpct::permute_sub_group_by_xor(item_ct1.get_sub_group(), tmp, mask);
}
if (tid == 0) {
#ifdef GGML_SYCL_F16
dst[row] = tmp.x() + tmp.y();
#else
dst[row] = tmp;
#endif
}
});
}
}
static void dequantize_mul_mat_vec_q8_0_sycl(const void *vx, const dfloat *y,
float *dst, const int ncols,
const int nrows,
@@ -1122,7 +1219,12 @@ void ggml_sycl_op_dequantize_mul_mat_vec(
dequantize_mul_mat_vec_q5_1_sycl(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
break;
case GGML_TYPE_Q8_0:
dequantize_mul_mat_vec_q8_0_sycl(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
if ((ggml_tensor_extra_gpu *) dst->src[0]->extra &&
((ggml_tensor_extra_gpu *) dst->src[0]->extra)->optimized_feature.reorder) {
dequantize_mul_mat_vec_q8_0_sycl_reorder(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
} else {
dequantize_mul_mat_vec_q8_0_sycl(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
}
break;
case GGML_TYPE_Q2_K:
dequantize_mul_mat_vec_q2_K_sycl(src0_dd_i, src1_ddf_i, dst_dd_i, ne00, row_diff, stream);
+10
View File
@@ -1252,6 +1252,16 @@ static void launch_fattn_tile_switch_ncols1(ggml_backend_sycl_context & ctx, ggm
return;
}
{
constexpr int cols_per_block = ncols2*2;
const int nwarps = ggml_sycl_fattn_tile_get_nthreads (DKQ, DV, cols_per_block, cc) / warp_size;
const int nbatch_fa = ggml_sycl_fattn_tile_get_nbatch_fa(DKQ, DV, cols_per_block, cc);
launch_fattn<DV, cols_per_block/ncols2, ncols2,
flash_attn_tile<DKQ, DV, cols_per_block / ncols2, ncols2, use_logit_softcap, warp_size>, warp_size>
(ctx, dst, nwarps, nbytes_shared, nbatch_fa, true, true, false);
return;
}
GGML_ABORT("fatal error");
}
+41 -1
View File
@@ -411,7 +411,7 @@ ggml_backend_sycl_buffer_init_tensor(ggml_backend_buffer_t buffer,
assert(tensor->view_src->buffer->buft == buffer->buft);
return GGML_STATUS_SUCCESS;
}
if ((tensor->type == GGML_TYPE_Q4_0 || tensor->type == GGML_TYPE_Q4_K || tensor->type == GGML_TYPE_Q6_K) &&
if ((tensor->type == GGML_TYPE_Q4_0 || tensor->type == GGML_TYPE_Q8_0 || tensor->type == GGML_TYPE_Q4_K || tensor->type == GGML_TYPE_Q6_K) &&
!g_ggml_sycl_disable_optimize) {
ggml_tensor_extra_gpu * extra = new ggml_tensor_extra_gpu{};
tensor->extra = extra;
@@ -3254,6 +3254,7 @@ inline bool ggml_sycl_supports_mmq(enum ggml_type type) {
inline bool ggml_sycl_supports_reorder_mul_mat_sycl(enum ggml_type type) {
switch (type) {
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q8_0:
return true;
case GGML_TYPE_Q4_K:
case GGML_TYPE_Q6_K:
@@ -3266,6 +3267,7 @@ inline bool ggml_sycl_supports_reorder_mul_mat_sycl(enum ggml_type type) {
inline bool ggml_sycl_supports_reorder_dmmv(enum ggml_type type) {
switch (type) {
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q8_0:
return true;
default:
return false;
@@ -3275,6 +3277,7 @@ inline bool ggml_sycl_supports_reorder_dmmv(enum ggml_type type) {
inline bool ggml_sycl_supports_reorder_mmvq(enum ggml_type type) {
switch (type) {
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q8_0:
case GGML_TYPE_Q4_K:
case GGML_TYPE_Q6_K:
return true;
@@ -3364,6 +3367,40 @@ static void reorder_qw_q4_0(uint8_t * data_device, const int ncols, const int nr
sycl_ext_free(stream, tmp_buf);
}
static void reorder_qw_q8_0(uint8_t * data_device, const int ncols, const int nrows, size_t size, size_t offset,
dpct::queue_ptr stream) {
uint8_t * tmp_buf = static_cast<uint8_t *>(sycl_ext_malloc_device(stream, size));
sycl::event copy_event;
SYCL_CHECK(CHECK_TRY_ERROR(copy_event = stream->memcpy(tmp_buf, data_device, size)));
if (!g_ggml_sycl_use_async_mem_op) {
copy_event.wait();
}
GGML_ASSERT((size % sizeof(block_q8_0) == 0));
GGML_ASSERT((offset % sizeof(block_q8_0) == 0));
int offset_blks = offset / sizeof(block_q8_0);
auto qs_ptr = data_device + offset_blks * QK8_0;
auto d_ptr = (sycl::half*)(qs_ptr + ncols * nrows) + offset_blks;
auto reorder_event = stream->parallel_for(
size / sizeof(block_q8_0),
[=](auto i) [[sycl::reqd_sub_group_size(WARP_SIZE)]] {
const block_q8_0* x = (const block_q8_0*)tmp_buf;
const int ib = i;
for (int j = 0; j < QK8_0; j++)
{
*((int8_t*)qs_ptr + ib * QK8_0 + j) = x[ib].qs[j];
}
*(d_ptr + ib) = x[ib].d;
});
if (!g_ggml_sycl_use_async_mem_op) {
reorder_event.wait_and_throw();
}
sycl_ext_free(stream, tmp_buf);
}
static void reorder_qw_q4_k(uint8_t * data_device, size_t size, size_t offset, dpct::queue_ptr stream) {
GGML_ASSERT(size % sizeof(block_q4_K) == 0);
GGML_ASSERT(offset % sizeof(block_q4_K) == 0);
@@ -3460,6 +3497,9 @@ static void reorder_qw(const ggml_tensor * src0, dpct::queue_ptr stream) {
case GGML_TYPE_Q4_0:
reorder_qw_q4_0(data_device, ncols, nrows, size, 0, stream);
break;
case GGML_TYPE_Q8_0:
reorder_qw_q8_0(data_device, ncols, nrows, size, 0, stream);
break;
case GGML_TYPE_Q4_K:
reorder_qw_q4_k(data_device, size, 0, stream);
break;
+26 -1
View File
@@ -679,6 +679,25 @@ static void mul_mat_vec_q5_1_q8_1_sycl(const void *vx, const void *vy,
}
}
static void reorder_mul_mat_vec_q8_0_q8_1_sycl(const void * vx, const void * vy, float * dst, const int ncols,
const int nrows, dpct::queue_ptr stream) {
GGML_ASSERT(ncols % QK8_0 == 0);
const int block_num_y = ceil_div(nrows, GGML_SYCL_MMV_Y);
constexpr size_t num_subgroups = 16;
GGML_ASSERT(block_num_y % num_subgroups == 0);
const sycl::range<3> global_size(1, GGML_SYCL_MMV_Y, (block_num_y * WARP_SIZE));
const sycl::range<3> workgroup_size(1, GGML_SYCL_MMV_Y, num_subgroups * WARP_SIZE);
stream->submit([&](sycl::handler & cgh) {
cgh.parallel_for(sycl::nd_range<3>(global_size, workgroup_size),
[=](sycl::nd_item<3> nd_item) [[sycl::reqd_sub_group_size(WARP_SIZE)]] {
mul_mat_vec_q_reorder<reorder_vec_dot_q_sycl<GGML_TYPE_Q8_0>>(vx, vy, dst, ncols, nrows,
nd_item);
});
});
}
static void mul_mat_vec_q8_0_q8_1_sycl(const void *vx, const void *vy,
float *dst, const int ncols,
const int nrows,
@@ -1101,7 +1120,13 @@ void ggml_sycl_op_mul_mat_vec_q(ggml_backend_sycl_context & ctx, const ggml_tens
mul_mat_vec_q5_1_q8_1_sycl(src0_dd_i, src1_ddq_i_bs, dst_dd_i_bs, ne00, row_diff, stream);
break;
case GGML_TYPE_Q8_0:
mul_mat_vec_q8_0_q8_1_sycl(src0_dd_i, src1_ddq_i_bs, dst_dd_i_bs, ne00, row_diff, stream);
if ((ggml_tensor_extra_gpu *) dst->src[0]->extra &&
((ggml_tensor_extra_gpu *) dst->src[0]->extra)->optimized_feature.reorder) {
GGML_SYCL_DEBUG("Calling reorder_mul_mat_vec_q8_0_q8_1_sycl\n");
reorder_mul_mat_vec_q8_0_q8_1_sycl(src0_dd_i, src1_ddq_i_bs, dst_dd_i_bs, ne00, row_diff, stream);
} else {
mul_mat_vec_q8_0_q8_1_sycl(src0_dd_i, src1_ddq_i_bs, dst_dd_i_bs, ne00, row_diff, stream);
}
break;
case GGML_TYPE_Q2_K:
mul_mat_vec_q2_K_q8_1_sycl(src0_dd_i, src1_ddq_i_bs, dst_dd_i_bs, ne00, row_diff, stream);
+21
View File
@@ -105,6 +105,27 @@ template <> struct block_q_t<GGML_TYPE_Q6_K> {
static constexpr int block_to_q8_1_ratio() { return traits::qk / QK8_1; }
};
template <> struct block_q_t<GGML_TYPE_Q8_0> {
struct traits {
static constexpr uint32_t qk = QK8_0; // 32
static constexpr uint32_t qi = QI8_0; // 8
static constexpr uint32_t qr = QR8_0; // 1
static constexpr uint32_t vdr_mmvq = 4;
};
// Q8_0 reorder layout: [qs0|qs1|...|qsN][d0|d1|...|dN]
// Each block has 32 int8 weights (32 bytes) followed by all scales
static constexpr std::pair<int, int> get_block_offset(const int block_index, const int /* nblocks */) {
return { block_index * QK8_0, 0 };
}
static constexpr std::pair<int, int> get_d_offset(int nrows, int ncols, const int block_index) {
return { (ncols * nrows) + block_index * sizeof(ggml_half), 0 };
}
static constexpr int block_to_q8_1_ratio() { return traits::qk / QK8_1; } // 1
};
} // namespace ggml_sycl_reordered
#endif // GGML_SYCL_QUANTS_HPP
+40
View File
@@ -351,6 +351,46 @@ template <> struct reorder_vec_dot_q_sycl<GGML_TYPE_Q4_0> {
};
};
template <> struct reorder_vec_dot_q_sycl<GGML_TYPE_Q8_0> {
static constexpr ggml_type gtype = GGML_TYPE_Q8_0;
using q8_0_block = ggml_sycl_reordered::block_q_t<GGML_TYPE_Q8_0>;
using q8_0_traits = typename q8_0_block::traits;
__dpct_inline__ float vec_dot_q8_0_q8_1_impl(const int * v, const int * u, const float & d8_0, const sycl::half2 & ds8) {
int sumi = 0;
#pragma unroll
for (size_t i = 0; i < q8_0_traits::vdr_mmvq; ++i) {
// Q8_0 values are signed int8, no nibble extraction needed
// Direct dp4a: each int packs 4 int8 values
sumi = dpct::dp4a(v[i], u[i], sumi);
}
const sycl::float2 ds8f = ds8.convert<float, sycl::rounding_mode::automatic>();
// Q8_0 has no bias term (values are signed), so just scale
return d8_0 * sumi * ds8f.x();
}
__dpct_inline__ float operator()(const void * __restrict__ vbq, const std::pair<int, int> ibx_offset,
const std::pair<int, int> d_offset, const int8_t * q8_1_quant_ptr,
const sycl::half2 * q8_1_ds, const int & iqs) {
const int8_t * bq8_0 = static_cast<const int8_t *>(vbq) + ibx_offset.first;
const ggml_half d = *(reinterpret_cast<const ggml_half *>(static_cast<const uint8_t *>(vbq) + d_offset.first));
int v[q8_0_traits::vdr_mmvq];
int u[q8_0_traits::vdr_mmvq];
#pragma unroll
for (size_t i = 0; i < q8_0_traits::vdr_mmvq; ++i) {
v[i] = get_int_from_int8(bq8_0, iqs + i);
u[i] = get_int_from_int8_aligned(q8_1_quant_ptr, iqs + i);
}
return vec_dot_q8_0_q8_1_impl(v, u, d, *q8_1_ds);
};
};
static inline float vec_dot_q4_K_q8_1_common(const int * __restrict__ q4, const uint16_t * __restrict__ scales,
const ggml_half2 & dm, const block_q8_1 * __restrict__ bq8_1,
const int & iqs) {
+16 -8
View File
@@ -3447,11 +3447,19 @@ static void ggml_vk_load_shaders(vk_device& device) {
CREATE_FA(GGML_TYPE_F16, f16, FA_SCALAR, )
CREATE_FA(GGML_TYPE_Q4_0, q4_0, FA_SCALAR, )
CREATE_FA(GGML_TYPE_Q8_0, q8_0, FA_SCALAR, )
CREATE_FA(GGML_TYPE_Q4_1, q4_1, FA_SCALAR, )
CREATE_FA(GGML_TYPE_Q5_0, q5_0, FA_SCALAR, )
CREATE_FA(GGML_TYPE_Q5_1, q5_1, FA_SCALAR, )
CREATE_FA(GGML_TYPE_IQ4_NL, iq4_nl, FA_SCALAR, )
} else {
CREATE_FA(GGML_TYPE_F32, f32, FA_SCALAR, _fp32)
CREATE_FA(GGML_TYPE_F16, f16, FA_SCALAR, _fp32)
CREATE_FA(GGML_TYPE_Q4_0, q4_0, FA_SCALAR, _fp32)
CREATE_FA(GGML_TYPE_Q8_0, q8_0, FA_SCALAR, _fp32)
CREATE_FA(GGML_TYPE_Q4_1, q4_1, FA_SCALAR, _fp32)
CREATE_FA(GGML_TYPE_Q5_0, q5_0, FA_SCALAR, _fp32)
CREATE_FA(GGML_TYPE_Q5_1, q5_1, FA_SCALAR, _fp32)
CREATE_FA(GGML_TYPE_IQ4_NL, iq4_nl, FA_SCALAR, _fp32)
}
#if defined(VK_KHR_cooperative_matrix) && defined(GGML_VULKAN_COOPMAT_GLSLC_SUPPORT)
if (device->coopmat1_fa_support) {
@@ -3459,6 +3467,10 @@ static void ggml_vk_load_shaders(vk_device& device) {
CREATE_FA(GGML_TYPE_F16, f16, FA_COOPMAT1, _cm1)
CREATE_FA(GGML_TYPE_Q4_0, q4_0, FA_COOPMAT1, _cm1)
CREATE_FA(GGML_TYPE_Q8_0, q8_0, FA_COOPMAT1, _cm1)
CREATE_FA(GGML_TYPE_Q4_1, q4_1, FA_COOPMAT1, _cm1)
CREATE_FA(GGML_TYPE_Q5_0, q5_0, FA_COOPMAT1, _cm1)
CREATE_FA(GGML_TYPE_Q5_1, q5_1, FA_COOPMAT1, _cm1)
CREATE_FA(GGML_TYPE_IQ4_NL, iq4_nl, FA_COOPMAT1, _cm1)
}
#endif
#if defined(VK_NV_cooperative_matrix2) && defined(GGML_VULKAN_COOPMAT2_GLSLC_SUPPORT)
@@ -15331,11 +15343,12 @@ static bool ggml_backend_vk_device_supports_op(ggml_backend_dev_t dev, const ggm
case GGML_TYPE_F32:
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q8_0:
// supported in scalar and coopmat2 paths
break;
case GGML_TYPE_Q4_1:
case GGML_TYPE_Q5_0:
case GGML_TYPE_Q5_1:
case GGML_TYPE_IQ4_NL:
// supported in scalar and coopmat2 paths
break;
// K dequants currently disabled because D dimension is rounded up to 256 and runs inefficiently
//case GGML_TYPE_Q2_K:
//case GGML_TYPE_Q3_K:
@@ -15350,12 +15363,7 @@ static bool ggml_backend_vk_device_supports_op(ggml_backend_dev_t dev, const ggm
//case GGML_TYPE_IQ3_XXS:
//case GGML_TYPE_IQ3_S:
//case GGML_TYPE_IQ4_XS:
case GGML_TYPE_IQ4_NL:
// currently supported only in coopmat2 path
if (!coopmat2) {
return false;
}
break;
default:
return false;
}
@@ -110,6 +110,97 @@ FLOAT_TYPEV4 dequantize4(uint ib, uint iqs, uint a_offset, uint binding_idx) {
#if defined(DATA_A_Q4_0)
#define BLOCK_BYTE_SIZE 18
#elif defined(DATA_A_Q4_1)
#define BLOCK_BYTE_SIZE 20
#endif
#if defined(DATA_A_Q4_0) || defined(DATA_A_Q4_1)
FLOAT_TYPEV4 dequantize4(uint ib, uint iqs, uint a_offset, uint binding_idx) {
if (binding_idx == BINDING_IDX_K) {
uint vui_lo = uint(k_packed.k_data_packed16[a_offset + ib].qs[(iqs & 0xF) / 2 + 0]);
uint vui_hi = uint(k_packed.k_data_packed16[a_offset + ib].qs[(iqs & 0xF) / 2 + 1]);
uint shift = (iqs & 0x10) >> 2;
vui_lo >>= shift;
vui_hi >>= shift;
FLOAT_TYPEV4 nibbles = FLOAT_TYPEV4(vui_lo & 0xF, (vui_lo >> 8) & 0xF, vui_hi & 0xF, (vui_hi >> 8) & 0xF);
#ifdef DATA_A_Q4_1
return FLOAT_TYPE(k_packed.k_data_packed16[a_offset + ib].d) * nibbles + FLOAT_TYPE(k_packed.k_data_packed16[a_offset + ib].m);
#else
return FLOAT_TYPE(k_packed.k_data_packed16[a_offset + ib].d) * (nibbles - FLOAT_TYPE(8.0f));
#endif
} else {
uint vui_lo = uint(v_packed.v_data_packed16[a_offset + ib].qs[(iqs & 0xF) / 2 + 0]);
uint vui_hi = uint(v_packed.v_data_packed16[a_offset + ib].qs[(iqs & 0xF) / 2 + 1]);
uint shift = (iqs & 0x10) >> 2;
vui_lo >>= shift;
vui_hi >>= shift;
FLOAT_TYPEV4 nibbles = FLOAT_TYPEV4(vui_lo & 0xF, (vui_lo >> 8) & 0xF, vui_hi & 0xF, (vui_hi >> 8) & 0xF);
#ifdef DATA_A_Q4_1
return FLOAT_TYPE(v_packed.v_data_packed16[a_offset + ib].d) * nibbles + FLOAT_TYPE(v_packed.v_data_packed16[a_offset + ib].m);
#else
return FLOAT_TYPE(v_packed.v_data_packed16[a_offset + ib].d) * (nibbles - FLOAT_TYPE(8.0f));
#endif
}
}
#endif
#if defined(DATA_A_Q5_0)
#define BLOCK_BYTE_SIZE 22
#elif defined(DATA_A_Q5_1)
#define BLOCK_BYTE_SIZE 24
#endif
#if defined(DATA_A_Q5_0) || defined(DATA_A_Q5_1)
FLOAT_TYPEV4 dequantize4(uint ib, uint iqs, uint a_offset, uint binding_idx) {
if (binding_idx == BINDING_IDX_K) {
uint vui_lo = uint(k_packed.k_data_packed16[a_offset + ib].qs[(iqs & 0xF) / 2 + 0]);
uint vui_hi = uint(k_packed.k_data_packed16[a_offset + ib].qs[(iqs & 0xF) / 2 + 1]);
uint shift = (iqs & 0x10) >> 2;
vui_lo >>= shift;
vui_hi >>= shift;
#ifdef DATA_A_Q5_1
uint qh = k_packed.k_data_packed16[a_offset + ib].qh;
#else
uint qh = uint(k_packed.k_data_packed16[a_offset + ib].qh[0]) | (uint(k_packed.k_data_packed16[a_offset + ib].qh[1]) << 16);
#endif
FLOAT_TYPEV4 hb = FLOAT_TYPEV4((qh >> iqs) & 1, (qh >> (iqs + 1)) & 1, (qh >> (iqs + 2)) & 1, (qh >> (iqs + 3)) & 1) * FLOAT_TYPE(16.0f);
FLOAT_TYPEV4 nibbles = FLOAT_TYPEV4(vui_lo & 0xF, (vui_lo >> 8) & 0xF, vui_hi & 0xF, (vui_hi >> 8) & 0xF);
#ifdef DATA_A_Q5_1
return FLOAT_TYPE(k_packed.k_data_packed16[a_offset + ib].d) * (nibbles + hb) + FLOAT_TYPE(k_packed.k_data_packed16[a_offset + ib].m);
#else
return FLOAT_TYPE(k_packed.k_data_packed16[a_offset + ib].d) * (nibbles + hb - FLOAT_TYPE(16.0f));
#endif
} else {
uint vui_lo = uint(v_packed.v_data_packed16[a_offset + ib].qs[(iqs & 0xF) / 2 + 0]);
uint vui_hi = uint(v_packed.v_data_packed16[a_offset + ib].qs[(iqs & 0xF) / 2 + 1]);
uint shift = (iqs & 0x10) >> 2;
vui_lo >>= shift;
vui_hi >>= shift;
#ifdef DATA_A_Q5_1
uint qh = v_packed.v_data_packed16[a_offset + ib].qh;
#else
uint qh = uint(v_packed.v_data_packed16[a_offset + ib].qh[0]) | (uint(v_packed.v_data_packed16[a_offset + ib].qh[1]) << 16);
#endif
FLOAT_TYPEV4 hb = FLOAT_TYPEV4((qh >> iqs) & 1, (qh >> (iqs + 1)) & 1, (qh >> (iqs + 2)) & 1, (qh >> (iqs + 3)) & 1) * FLOAT_TYPE(16.0f);
FLOAT_TYPEV4 nibbles = FLOAT_TYPEV4(vui_lo & 0xF, (vui_lo >> 8) & 0xF, vui_hi & 0xF, (vui_hi >> 8) & 0xF);
#ifdef DATA_A_Q5_1
return FLOAT_TYPE(v_packed.v_data_packed16[a_offset + ib].d) * (nibbles + hb) + FLOAT_TYPE(v_packed.v_data_packed16[a_offset + ib].m);
#else
return FLOAT_TYPE(v_packed.v_data_packed16[a_offset + ib].d) * (nibbles + hb - FLOAT_TYPE(16.0f));
#endif
}
}
#endif
#if defined(DATA_A_IQ4_NL)
#define BLOCK_BYTE_SIZE 18
FLOAT_TYPEV4 dequantize4(uint ib, uint iqs, uint a_offset, uint binding_idx) {
if (binding_idx == BINDING_IDX_K) {
@@ -119,7 +210,11 @@ FLOAT_TYPEV4 dequantize4(uint ib, uint iqs, uint a_offset, uint binding_idx) {
vui_lo >>= shift;
vui_hi >>= shift;
return FLOAT_TYPE(k_packed.k_data_packed16[a_offset + ib].d) * (FLOAT_TYPEV4(vui_lo & 0xF, (vui_lo >> 8) & 0xF, vui_hi & 0xF, (vui_hi >> 8) & 0xF) - FLOAT_TYPE(8.0f));
return FLOAT_TYPE(k_packed.k_data_packed16[a_offset + ib].d) * FLOAT_TYPEV4(
kvalues_iq4nl[vui_lo & 0xF],
kvalues_iq4nl[(vui_lo >> 8) & 0xF],
kvalues_iq4nl[vui_hi & 0xF],
kvalues_iq4nl[(vui_hi >> 8) & 0xF]);
} else {
uint vui_lo = uint(v_packed.v_data_packed16[a_offset + ib].qs[(iqs & 0xF) / 2 + 0]);
uint vui_hi = uint(v_packed.v_data_packed16[a_offset + ib].qs[(iqs & 0xF) / 2 + 1]);
@@ -127,11 +222,14 @@ FLOAT_TYPEV4 dequantize4(uint ib, uint iqs, uint a_offset, uint binding_idx) {
vui_lo >>= shift;
vui_hi >>= shift;
return FLOAT_TYPE(v_packed.v_data_packed16[a_offset + ib].d) * (FLOAT_TYPEV4(vui_lo & 0xF, (vui_lo >> 8) & 0xF, vui_hi & 0xF, (vui_hi >> 8) & 0xF) - FLOAT_TYPE(8.0f));
return FLOAT_TYPE(v_packed.v_data_packed16[a_offset + ib].d) * FLOAT_TYPEV4(
kvalues_iq4nl[vui_lo & 0xF],
kvalues_iq4nl[(vui_lo >> 8) & 0xF],
kvalues_iq4nl[vui_hi & 0xF],
kvalues_iq4nl[(vui_hi >> 8) & 0xF]);
}
}
#endif
#if defined(DATA_A_Q8_0)
#define BLOCK_BYTE_SIZE 34
FLOAT_TYPEV4 dequantize4(uint ib, uint iqs, uint a_offset, uint binding_idx) {
@@ -137,6 +137,7 @@ void execute_command(std::vector<std::string>& command, std::string& stdout_str,
pid_t pid = fork();
if (pid < 0) {
std::cerr << strerror(errno) << "\n";
throw std::runtime_error("Failed to fork process");
}
@@ -655,7 +656,7 @@ void process_shaders() {
if (tname == "f16") {
string_to_spv("flash_attn_f32_f16_" + tname, "flash_attn_cm1.comp",
merge_maps(fa_base_dict, {{"Q_TYPE", "float"}, {"D_TYPE", "float"}, {"D_TYPEV4", "vec4"}, {"COOPMAT", "1"}}), fp16, true, false, f16acc);
} else if (tname == "q4_0" || tname == "q8_0" || tname == "f32") {
} else if (tname == "q4_0" || tname == "q4_1" || tname == "q5_0" || tname == "q5_1" || tname == "iq4_nl" || tname == "q8_0" || tname == "f32") {
std::string data_a_key = "DATA_A_" + to_uppercase(tname);
string_to_spv("flash_attn_f32_f16_" + tname, "flash_attn_cm1.comp",
merge_maps(fa_base_dict, {{data_a_key, "1"}, {"Q_TYPE", "float"}, {"D_TYPE", "float"}, {"D_TYPEV4", "vec4"}, {"BLOCK_SIZE", "QUANT_K_"+to_uppercase(tname)}, {"COOPMAT", "1"}}), fp16, true, false, f16acc);
@@ -666,7 +667,7 @@ void process_shaders() {
if (tname == "f16") {
string_to_spv("flash_attn_f32_f16_" + tname, "flash_attn.comp",
merge_maps(fa_base_dict, {{"Q_TYPE", "float"}, {"D_TYPE", "float"}, {"D_TYPEV4", "vec4"}}), fp16, false, false, f16acc);
} else if (tname == "q4_0" || tname == "q8_0" || tname == "f32") {
} else if (tname == "q4_0" || tname == "q4_1" || tname == "q5_0" || tname == "q5_1" || tname == "iq4_nl" || tname == "q8_0" || tname == "f32") {
std::string data_a_key = "DATA_A_" + to_uppercase(tname);
string_to_spv("flash_attn_f32_f16_" + tname, "flash_attn.comp",
merge_maps(fa_base_dict, {{data_a_key, "1"}, {"Q_TYPE", "float"}, {"D_TYPE", "float"}, {"D_TYPEV4", "vec4"}, {"BLOCK_SIZE", "QUANT_K_"+to_uppercase(tname) }}), fp16, false, false, f16acc);
+157 -20
View File
@@ -437,12 +437,18 @@ inline uint32_t ggml_webgpu_flash_attn_pick_vec_ne(const ggml_webgpu_flash_attn_
// Head-dim specializations used by the tuned vec f16 path.
switch (key.head_dim_qk) {
case 64: return 2u;
case 96: return 4u;
case 128: return 1u;
case 192: return 2u;
case 576: return 2u;
default: return 1u;
case 64:
return 2u;
case 96:
return 4u;
case 128:
return 1u;
case 192:
return 2u;
case 576:
return 2u;
default:
return 1u;
}
}
@@ -513,9 +519,9 @@ struct ggml_webgpu_flash_attn_blk_shader_lib_context {
};
inline ggml_webgpu_processed_shader ggml_webgpu_preprocess_flash_attn_blk_shader(
pre_wgsl::Preprocessor & preprocessor,
const char * shader_src,
const ggml_webgpu_flash_attn_blk_shader_lib_context & context) {
pre_wgsl::Preprocessor & preprocessor,
const char * shader_src,
const ggml_webgpu_flash_attn_blk_shader_lib_context & context) {
std::vector<std::string> defines;
std::string variant = "flash_attn_vec_blk";
@@ -652,6 +658,26 @@ struct ggml_webgpu_mul_mat_shader_decisions {
uint32_t mul_mat_wg_size;
};
/** MUL_MAT_ID **/
struct ggml_webgpu_mul_mat_id_pipeline_key {
ggml_type src0_type;
ggml_type src1_type;
bool operator==(const ggml_webgpu_mul_mat_id_pipeline_key & other) const {
return src0_type == other.src0_type && src1_type == other.src1_type;
}
};
struct ggml_webgpu_mul_mat_id_pipeline_key_hash {
size_t operator()(const ggml_webgpu_mul_mat_id_pipeline_key & key) const {
size_t seed = 0;
ggml_webgpu_hash_combine(seed, key.src0_type);
ggml_webgpu_hash_combine(seed, key.src1_type);
return seed;
}
};
/** Cpy **/
struct ggml_webgpu_cpy_pipeline_key {
@@ -791,7 +817,10 @@ class ggml_webgpu_shader_lib {
std::unordered_map<ggml_webgpu_mul_mat_vec_pipeline_key, webgpu_pipeline, ggml_webgpu_mul_mat_vec_pipeline_key_hash>
mul_mat_vec_pipelines; // fast mat-vec (n==1)
std::unordered_map<ggml_webgpu_mul_mat_pipeline_key, webgpu_pipeline, ggml_webgpu_mul_mat_pipeline_key_hash>
mul_mat_fast_pipelines; // fast mat-mat (reg-tile or subgroup)
mul_mat_fast_pipelines; // fast mat-mat (reg-tile or subgroup)
std::unordered_map<int, webgpu_pipeline> mul_mat_id_gather_pipelines; // key is fixed
std::unordered_map<ggml_webgpu_mul_mat_id_pipeline_key, webgpu_pipeline, ggml_webgpu_mul_mat_id_pipeline_key_hash>
mul_mat_id_pipelines; // src0_type/src1_type
std::unordered_map<ggml_webgpu_set_rows_pipeline_key, webgpu_pipeline, ggml_webgpu_set_rows_pipeline_key_hash>
set_rows_pipelines;
@@ -1592,6 +1621,115 @@ class ggml_webgpu_shader_lib {
return mul_mat_legacy_pipelines[key];
}
webgpu_pipeline get_mul_mat_id_gather_pipeline(const ggml_webgpu_shader_lib_context & context) {
auto it = mul_mat_id_gather_pipelines.find(1);
if (it != mul_mat_id_gather_pipelines.end()) {
return it->second;
}
std::vector<std::string> defines;
defines.push_back(std::string("WG_SIZE=") + std::to_string(context.max_wg_size));
auto processed = preprocessor.preprocess(wgsl_mul_mat_id_gather, defines);
auto decisions = std::make_shared<ggml_webgpu_generic_shader_decisions>();
decisions->wg_size = context.max_wg_size;
webgpu_pipeline pipeline = ggml_webgpu_create_pipeline(device, processed, "mul_mat_id_gather");
pipeline.context = decisions;
mul_mat_id_gather_pipelines[1] = pipeline;
return pipeline;
}
webgpu_pipeline get_mul_mat_id_pipeline(const ggml_webgpu_shader_lib_context & context) {
ggml_webgpu_mul_mat_id_pipeline_key key = {
.src0_type = context.src0->type,
.src1_type = context.src1->type,
};
auto it = mul_mat_id_pipelines.find(key);
if (it != mul_mat_id_pipelines.end()) {
return it->second;
}
std::vector<std::string> defines;
std::string variant = "mul_mat_id";
defines.push_back("MUL_MAT_ID");
// src1 type
switch (context.src1->type) {
case GGML_TYPE_F32:
defines.push_back("SRC1_INNER_TYPE=f32");
break;
case GGML_TYPE_F16:
defines.push_back("SRC1_INNER_TYPE=f16");
break;
default:
GGML_ABORT("Unsupported src1 type for mul_mat fast shader");
}
// src0 type
const struct ggml_type_traits * src0_traits = ggml_get_type_traits(context.src0->type);
const char * src0_name = src0_traits->type_name;
switch (context.src0->type) {
case GGML_TYPE_F32:
defines.push_back("SRC0_INNER_TYPE=f32");
defines.push_back("FLOAT");
defines.push_back("INIT_SRC0_SHMEM_FLOAT");
defines.push_back("INIT_SRC1_SHMEM_FLOAT");
variant += "_f32";
break;
case GGML_TYPE_F16:
defines.push_back("SRC0_INNER_TYPE=f16");
defines.push_back("FLOAT");
defines.push_back("INIT_SRC0_SHMEM_FLOAT");
defines.push_back("INIT_SRC1_SHMEM_FLOAT");
variant += "_f16";
break;
default:
{
std::string type_upper = src0_name;
std::transform(type_upper.begin(), type_upper.end(), type_upper.begin(), ::toupper);
defines.push_back("BYTE_HELPERS");
defines.push_back("INIT_SRC0_SHMEM_" + type_upper);
defines.push_back("INIT_SRC1_SHMEM_FLOAT");
defines.push_back("U32_DEQUANT_HELPERS");
defines.push_back("SRC0_INNER_TYPE=u32");
variant += std::string("_") + src0_name;
break;
}
}
defines.push_back("SCALAR");
// Tiles
defines.push_back("TILE_M=" + std::to_string(WEBGPU_MUL_MAT_TILE_M) + "u");
defines.push_back("TILE_N=" + std::to_string(WEBGPU_MUL_MAT_TILE_N) + "u");
defines.push_back("TILE_K=" + std::to_string(WEBGPU_MUL_MAT_TILE_K) + "u");
defines.push_back("WORKGROUP_SIZE_M=" + std::to_string(WEBGPU_MUL_MAT_WG_SIZE_M) + "u");
defines.push_back("WORKGROUP_SIZE_N=" + std::to_string(WEBGPU_MUL_MAT_WG_SIZE_N) + "u");
// variant suffix for src1 type
variant += std::string("_") + (context.src1->type == GGML_TYPE_F32 ? "f32" : "f16");
auto processed = preprocessor.preprocess(wgsl_mul_mat_id, defines);
auto decisions = std::make_shared<ggml_webgpu_mul_mat_shader_decisions>();
decisions->tile_k = WEBGPU_MUL_MAT_TILE_K;
decisions->tile_m = WEBGPU_MUL_MAT_TILE_M;
decisions->tile_n = WEBGPU_MUL_MAT_TILE_N;
decisions->wg_size_m = WEBGPU_MUL_MAT_WG_SIZE_M;
decisions->wg_size_n = WEBGPU_MUL_MAT_WG_SIZE_N;
decisions->wg_size = WEBGPU_MUL_MAT_WG_SIZE_M * WEBGPU_MUL_MAT_WG_SIZE_N;
webgpu_pipeline pipeline = ggml_webgpu_create_pipeline(device, processed, variant);
pipeline.context = decisions;
mul_mat_id_pipelines[key] = pipeline;
return mul_mat_id_pipelines[key];
}
webgpu_pipeline get_unary_pipeline(const ggml_webgpu_shader_lib_context & context) {
const bool is_unary = context.dst->op == GGML_OP_UNARY;
const int op = is_unary ? (int) ggml_get_unary_op(context.dst) : context.dst->op;
@@ -1857,9 +1995,8 @@ class ggml_webgpu_shader_lib {
defines.push_back(std::string("SG_MAT_K=") + std::to_string(context.sg_mat_k));
uint32_t q_tile = context.sg_mat_m;
uint32_t kv_tile =
std::min(ggml_webgpu_flash_attn_max_kv_tile(context),
context.sg_mat_n * GGML_WEBGPU_FLASH_ATTN_PREFERRED_KV_SG_TILES);
uint32_t kv_tile = std::min(ggml_webgpu_flash_attn_max_kv_tile(context),
context.sg_mat_n * GGML_WEBGPU_FLASH_ATTN_PREFERRED_KV_SG_TILES);
if (context.key.use_vec) {
q_tile = 1;
kv_tile = std::max(context.sg_mat_n, std::min(32u, ggml_webgpu_flash_attn_max_kv_tile(context)));
@@ -1885,14 +2022,14 @@ class ggml_webgpu_shader_lib {
}
defines.push_back(std::string("WG_SIZE=") + std::to_string(wg_size));
const char * shader_src = context.key.use_vec ? wgsl_flash_attn_vec_split : wgsl_flash_attn;
const char * shader_src = context.key.use_vec ? wgsl_flash_attn_vec_split : wgsl_flash_attn;
webgpu_pipeline pipeline =
ggml_webgpu_create_pipeline(device, preprocessor.preprocess(shader_src, defines), variant);
auto decisions = std::make_shared<ggml_webgpu_flash_attn_shader_decisions>();
decisions->q_tile = q_tile;
decisions->kv_tile = kv_tile;
decisions->wg_size = wg_size;
pipeline.context = decisions;
auto decisions = std::make_shared<ggml_webgpu_flash_attn_shader_decisions>();
decisions->q_tile = q_tile;
decisions->kv_tile = kv_tile;
decisions->wg_size = wg_size;
pipeline.context = decisions;
flash_attn_pipelines[context.key] = pipeline;
return flash_attn_pipelines[context.key];
}
@@ -1905,7 +2042,7 @@ class ggml_webgpu_shader_lib {
ggml_webgpu_processed_shader processed =
ggml_webgpu_preprocess_flash_attn_blk_shader(preprocessor, wgsl_flash_attn_vec_blk, context);
webgpu_pipeline pipeline = ggml_webgpu_create_pipeline(device, processed.wgsl, processed.variant);
webgpu_pipeline pipeline = ggml_webgpu_create_pipeline(device, processed.wgsl, processed.variant);
flash_attn_blk_pipelines[context.key] = pipeline;
return flash_attn_blk_pipelines[context.key];
}
File diff suppressed because it is too large Load Diff
@@ -42,6 +42,7 @@ fn init_shmem_src0(thread_id: u32, batch_offset: u32, offset_m: u32, k_outer: u3
}
#endif // INIT_SRC0_SHMEM_FLOAT
#ifndef MUL_MAT_ID
#ifdef INIT_SRC1_SHMEM_FLOAT
fn init_shmem_src1(thread_id: u32, batch_offset: u32, offset_n: u32, k_outer: u32) {
for (var elem_idx = thread_id * VEC_SIZE; elem_idx < TILE_SRC1_SHMEM; elem_idx += TOTAL_WORKGROUP_SIZE * VEC_SIZE) {
@@ -58,6 +59,7 @@ fn init_shmem_src1(thread_id: u32, batch_offset: u32, offset_n: u32, k_outer: u3
}
}
#endif // INIT_SRC1_SHMEM_FLOAT
#endif
#ifdef INIT_SRC0_SHMEM_Q4_0
const BLOCK_SIZE = 32u;
@@ -0,0 +1,193 @@
enable f16;
#include "common_decls.tmpl"
#include "mul_mat_decls.tmpl"
#ifdef VEC
fn store_val(acc: array<array<f16, TILE_M>, TILE_N>, tn: u32, tm: u32) -> vec4<f32> {
return vec4<f32>(f32(acc[tn][tm]), f32(acc[tn][tm + 1]), f32(acc[tn][tm + 2]), f32(acc[tn][tm + 3]));
}
#endif
#ifdef SCALAR
fn store_val(acc: array<array<f16, TILE_M>, TILE_N>, tn: u32, tm: u32) -> f32 {
return f32(acc[tn][tm]);
}
#endif
struct MulMatIdParams {
offset_src0: u32,
offset_src1: u32,
offset_dst: u32,
k: u32,
m: u32,
n_expert: u32,
n_expert_used: u32,
n_tokens: u32,
b_ne1: u32,
stride_01: u32,
stride_11: u32,
stride_02: u32,
stride_12: u32,
};
@group(0) @binding(0) var<storage, read_write> src0: array<SRC0_TYPE>; // [cols, rows, n_expert]
@group(0) @binding(1) var<storage, read_write> src1: array<SRC1_TYPE>; // [cols, b_ne1, n_tokens]
@group(0) @binding(2) var<storage, read_write> dst: array<DST_TYPE>; // [rows, n_expert_used, n_tokens]
@group(0) @binding(3) var<storage, read_write> global_gathered_expert_used: array<u32>; // [n_expert][n_tokens]
@group(0) @binding(4) var<storage, read_write> global_gathered_tokens: array<u32>; // [n_expert][n_tokens]
@group(0) @binding(5) var<storage, read_write> gathered_count_ids: array<u32>; // [n_expert]
@group(0) @binding(6) var<uniform> params: MulMatIdParams;
fn get_local_n(thread_id: u32) -> u32 {
return thread_id / WORKGROUP_SIZE_M;
}
fn get_local_m(thread_id: u32) -> u32 {
return thread_id % WORKGROUP_SIZE_M;
}
const TOTAL_WORKGROUP_SIZE = WORKGROUP_SIZE_M * WORKGROUP_SIZE_N;
const TILE_SRC0_SHMEM = TILE_K * WORKGROUP_SIZE_M * TILE_M;
const TILE_SRC1_SHMEM = TILE_K * WORKGROUP_SIZE_N * TILE_N;
var<workgroup> shmem: array<f16, TILE_SRC0_SHMEM + TILE_SRC1_SHMEM>;
var<workgroup> gathered_expert_used: array<u32, TILE_N * WORKGROUP_SIZE_N>;
var<workgroup> gathered_tokens: array<u32, TILE_N * WORKGROUP_SIZE_N>;
#ifdef INIT_SRC1_SHMEM_FLOAT
fn init_shmem_id_src1(thread_id: u32, offset_src1: u32, rest_token_n: u32, k_outer: u32) {
for (var elem_idx = thread_id * VEC_SIZE; elem_idx < TILE_SRC1_SHMEM; elem_idx += TOTAL_WORKGROUP_SIZE * VEC_SIZE) {
let tile_n = elem_idx / TILE_K;
let tile_k = elem_idx % TILE_K;
if (tile_n < rest_token_n) {
let global_src10 = k_outer + tile_k;
let expert_used_idx = gathered_expert_used[tile_n] % params.b_ne1;
let token_idx = gathered_tokens[tile_n];
let src1_idx = offset_src1 + token_idx * params.stride_12 + expert_used_idx * params.stride_11 + global_src10;
let src1_val = select(
SRC1_TYPE(0.0),
src1[src1_idx/VEC_SIZE],
global_src10 < params.k);
store_shmem(SHMEM_TYPE(src1_val), TILE_SRC0_SHMEM + elem_idx);
} else {
store_shmem(SHMEM_TYPE(0.0), TILE_SRC0_SHMEM + elem_idx);
}
}
}
#endif // INIT_SRC1_SHMEM_FLOAT
@compute @workgroup_size(TOTAL_WORKGROUP_SIZE)
fn main(@builtin(workgroup_id) wg_id: vec3<u32>,
@builtin(local_invocation_id) local_id: vec3<u32>,
@builtin(num_workgroups) num_wg: vec3<u32>) {
let thread_id = local_id.x;
let local_m = get_local_m(thread_id);
let local_n = get_local_n(thread_id);
var expert_idx:u32 = 0xFFFFFFFFu;
var wg_in_batch:u32 = 0;
var wg_sum:u32 = 0;
let wg_m_count = (params.m + WORKGROUP_SIZE_M * TILE_M - 1u) / (WORKGROUP_SIZE_M * TILE_M);
let wg_linear = wg_id.y * num_wg.x + wg_id.x;
for (var i = 0u;i < params.n_expert;i += 1) {
let wg_n_count = (gathered_count_ids[i] + WORKGROUP_SIZE_N * TILE_N - 1u) / (WORKGROUP_SIZE_N * TILE_N);
let wg_per_matrix = wg_m_count * wg_n_count;
if (wg_sum <= wg_linear && wg_linear < wg_sum + wg_per_matrix) {
expert_idx = i;
wg_in_batch = wg_linear - wg_sum;
break;
}
wg_sum += wg_per_matrix;
}
let is_valid = expert_idx != 0xFFFFFFFFu;
var wg_m: u32 = 0;
var wg_n: u32 = 0;
var offset_wg_m: u32 = 0;
var offset_wg_n: u32 = 0;
var rest_token_n: u32 = 0;
var src0_batch_offset: u32 = 0;
wg_m = wg_in_batch % wg_m_count;
wg_n = wg_in_batch / wg_m_count;
offset_wg_m = wg_m * WORKGROUP_SIZE_M * TILE_M;
offset_wg_n = wg_n * WORKGROUP_SIZE_N * TILE_N;
if (is_valid) {
rest_token_n = gathered_count_ids[expert_idx] - offset_wg_n;
let global_gathered_base = expert_idx * params.n_tokens + offset_wg_n;
for (var i = thread_id; i < TILE_N * WORKGROUP_SIZE_N && offset_wg_n + i < gathered_count_ids[expert_idx]; i += TOTAL_WORKGROUP_SIZE) {
gathered_expert_used[i] = global_gathered_expert_used[global_gathered_base + i];
gathered_tokens[i] = global_gathered_tokens[global_gathered_base + i];
}
src0_batch_offset = params.offset_src0 + expert_idx * params.stride_02;
}
workgroupBarrier();
let output_row_base = offset_wg_m + local_m * TILE_M;
let output_col_base = offset_wg_n + local_n * TILE_N;
let dst2_stride = params.m * params.n_expert_used;
let dst1_stride = params.m;
var acc: array<array<f16, TILE_M>, TILE_N>;
for (var k_outer = 0u; k_outer < params.k; k_outer += TILE_K) {
if (is_valid) {
init_shmem_src0(thread_id, src0_batch_offset, offset_wg_m, k_outer);
init_shmem_id_src1(thread_id, params.offset_src1, rest_token_n, k_outer);
}
workgroupBarrier();
if (is_valid) {
let k_end = min(TILE_K, params.k - k_outer);
for (var k_inner = 0u; k_inner < k_end; k_inner++) {
var src0_tile: array<f16, TILE_M>;
for (var tm = 0u; tm < TILE_M; tm++) {
let src0_m = local_m * TILE_M + tm;
let src0_idx = k_inner + src0_m * TILE_K;
src0_tile[tm] = shmem[src0_idx];
}
for (var tn = 0u; tn < TILE_N; tn++) {
let src1_n = local_n * TILE_N + tn;
let src1_idx = src1_n * TILE_K + k_inner;
let src1_val = shmem[TILE_SRC0_SHMEM + src1_idx];
for (var tm = 0u; tm < TILE_M; tm++) {
acc[tn][tm] += src0_tile[tm] * src1_val;
}
}
}
}
workgroupBarrier();
}
if (is_valid) {
for (var tn = 0u; tn < TILE_N; tn++) {
let n_idx = output_col_base + tn;
if (n_idx < gathered_count_ids[expert_idx]) {
let dst1_idx = gathered_expert_used[n_idx - offset_wg_n];
let dst2_idx = gathered_tokens[n_idx - offset_wg_n];
let dst12_offset = params.offset_dst + dst2_idx * dst2_stride + dst1_idx * dst1_stride;
for (var tm = 0u; tm < TILE_M; tm += VEC_SIZE) {
let global_row = output_row_base + tm;
if (global_row < params.m) {
let dst_idx = dst12_offset + global_row;
dst[dst_idx/VEC_SIZE] = store_val(acc, tn, tm);
}
}
}
}
}
}
@@ -0,0 +1,55 @@
enable f16;
struct MulMatIdGatherParams {
offset_ids: u32,
n_expert: u32,
n_expert_used: u32,
n_tokens: u32,
stride_ids_1: u32,
};
@group(0) @binding(0) var<storage, read_write> ids: array<i32>; // [n_expert_used, n_tokens]
@group(0) @binding(1) var<storage, read_write> global_gathered_expert_used: array<u32>; // [n_expert][n_tokens]
@group(0) @binding(2) var<storage, read_write> global_gathered_tokens: array<u32>; // [n_expert][n_tokens]
@group(0) @binding(3) var<storage, read_write> gathered_count_ids: array<u32>; // [n_expert]
@group(0) @binding(4) var<uniform> params: MulMatIdGatherParams;
var<workgroup> count:atomic<u32>;
@compute @workgroup_size(WG_SIZE)
fn main(@builtin(workgroup_id) wg_id: vec3<u32>,
@builtin(local_invocation_id) local_id: vec3<u32>,
@builtin(num_workgroups) num_wg: vec3<u32>) {
let thread_id = local_id.x;
let own_expert = wg_id.y * num_wg.x + wg_id.x; // the expert assigned to this workgroup
if (own_expert < params.n_expert) {
if (thread_id == 0u) {
atomicStore(&count, 0);
}
workgroupBarrier();
for (var i = thread_id;i < params.n_expert_used * params.n_tokens;i += WG_SIZE) {
let row = i / params.n_expert_used;
let col = i % params.n_expert_used;
let expert = u32(ids[params.offset_ids + row * params.stride_ids_1 + col]);
if (own_expert == expert) {
let pos = atomicAdd(&count, 1u);
let gathered_id = own_expert * params.n_tokens + pos;
global_gathered_expert_used[gathered_id] = col;
global_gathered_tokens[gathered_id] = row;
}
}
workgroupBarrier();
if (thread_id == 0u) {
gathered_count_ids[own_expert] = atomicLoad(&count);
}
}
}
+10
View File
@@ -651,6 +651,14 @@ static const struct ggml_type_traits type_traits[GGML_TYPE_COUNT] = {
.to_float = (ggml_to_float_t) ggml_fp16_to_fp32_row,
.from_float_ref = (ggml_from_float_t) ggml_fp32_to_fp16_row,
},
[GGML_TYPE_Q1_0] = {
.type_name = "q1_0",
.blck_size = QK1_0,
.type_size = sizeof(block_q1_0),
.is_quantized = true,
.to_float = (ggml_to_float_t) dequantize_row_q1_0,
.from_float_ref = (ggml_from_float_t) quantize_row_q1_0_ref,
},
[GGML_TYPE_Q4_0] = {
.type_name = "q4_0",
.blck_size = QK4_0,
@@ -1384,6 +1392,7 @@ enum ggml_type ggml_ftype_to_ggml_type(enum ggml_ftype ftype) {
case GGML_FTYPE_MOSTLY_BF16: wtype = GGML_TYPE_BF16; break;
case GGML_FTYPE_MOSTLY_Q4_0: wtype = GGML_TYPE_Q4_0; break;
case GGML_FTYPE_MOSTLY_Q4_1: wtype = GGML_TYPE_Q4_1; break;
case GGML_FTYPE_MOSTLY_Q1_0: wtype = GGML_TYPE_Q1_0; break;
case GGML_FTYPE_MOSTLY_Q5_0: wtype = GGML_TYPE_Q5_0; break;
case GGML_FTYPE_MOSTLY_Q5_1: wtype = GGML_TYPE_Q5_1; break;
case GGML_FTYPE_MOSTLY_Q8_0: wtype = GGML_TYPE_Q8_0; break;
@@ -7652,6 +7661,7 @@ size_t ggml_quantize_chunk(
size_t result = 0;
switch (type) {
case GGML_TYPE_Q1_0: result = quantize_q1_0(src + start, (char *) dst + start_row * row_size, nrows, n_per_row, imatrix); break;
case GGML_TYPE_Q4_0: result = quantize_q4_0(src + start, (char *) dst + start_row * row_size, nrows, n_per_row, imatrix); break;
case GGML_TYPE_Q4_1: result = quantize_q4_1(src + start, (char *) dst + start_row * row_size, nrows, n_per_row, imatrix); break;
case GGML_TYPE_Q5_0: result = quantize_q5_0(src + start, (char *) dst + start_row * row_size, nrows, n_per_row, imatrix); break;
+17
View File
@@ -506,6 +506,7 @@ class VISION_PROJECTOR_TYPE(IntEnum):
GEMMA3N = auto()
GEMMA3 = auto()
QWEN3VL = auto()
STEP3VL = auto()
COGVLM = auto()
@@ -734,6 +735,7 @@ class MODEL_TENSOR(IntEnum):
V_LAYER_OUT_SCALE = auto()
V_PRE_NORM = auto()
V_POST_NORM = auto()
V_MM_PRE_NORM = auto() # hunyuanocr
V_MM_POST_NORM = auto()
V_MM_INP_NORM = auto()
V_MM_INP_PROJ = auto() # gemma3
@@ -769,6 +771,8 @@ class MODEL_TENSOR(IntEnum):
V_MM_GATE = auto() # cogvlm
V_TOK_BOI = auto() # cogvlm
V_TOK_EOI = auto() # cogvlm
V_TOK_IMG_BEGIN = auto() # hunyuanocr
V_TOK_IMG_END = auto() # hunyuanocr
V_STD_BIAS = auto() # gemma4
V_STD_SCALE = auto() # gemma4
V_SAM_POS_EMBD = auto() # Deepseek-OCR
@@ -984,6 +988,8 @@ VISION_PROJECTOR_TYPE_NAMES: dict[VISION_PROJECTOR_TYPE, str] = {
VISION_PROJECTOR_TYPE.GLM_EDGE: "adapter",
VISION_PROJECTOR_TYPE.MERGER: "qwen2vl_merger",
VISION_PROJECTOR_TYPE.GEMMA3: "gemma3",
VISION_PROJECTOR_TYPE.QWEN3VL: "qwen3vl_merger",
VISION_PROJECTOR_TYPE.STEP3VL: "step3vl",
}
TENSOR_NAMES: dict[MODEL_TENSOR, str] = {
@@ -1246,6 +1252,9 @@ TENSOR_NAMES: dict[MODEL_TENSOR, str] = {
MODEL_TENSOR.V_MM_GATE: "mm.gate",
MODEL_TENSOR.V_TOK_BOI: "v.boi",
MODEL_TENSOR.V_TOK_EOI: "v.eoi",
MODEL_TENSOR.V_MM_PRE_NORM: "mm.pre_norm",
MODEL_TENSOR.V_TOK_IMG_BEGIN: "mm.image_begin",
MODEL_TENSOR.V_TOK_IMG_END: "mm.image_end",
MODEL_TENSOR.V_STD_BIAS: "v.std_bias", # gemma4
MODEL_TENSOR.V_STD_SCALE: "v.std_scale", # gemma4
# DeepSeek-OCR SAM
@@ -1393,6 +1402,9 @@ MODEL_TENSORS: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
MODEL_TENSOR.V_MM_GATE,
MODEL_TENSOR.V_TOK_BOI,
MODEL_TENSOR.V_TOK_EOI,
MODEL_TENSOR.V_MM_PRE_NORM,
MODEL_TENSOR.V_TOK_IMG_BEGIN,
MODEL_TENSOR.V_TOK_IMG_END,
MODEL_TENSOR.V_STD_BIAS,
MODEL_TENSOR.V_STD_SCALE,
MODEL_TENSOR.V_SAM_POS_EMBD,
@@ -3987,6 +3999,7 @@ class GGMLQuantizationType(IntEnum):
TQ2_0 = 35
MXFP4 = 39
NVFP4 = 40
Q1_0 = 41
class ExpertGatingFuncType(IntEnum):
@@ -4040,6 +4053,7 @@ class LlamaFileType(IntEnum):
MOSTLY_TQ2_0 = 37 # except 1d tensors
MOSTLY_MXFP4_MOE = 38 # except 1d tensors
MOSTLY_NVFP4 = 39 # except 1d tensors
MOSTLY_Q1_0 = 40 # except 1d tensors
GUESSED = 1024 # not specified in the model file
@@ -4094,6 +4108,7 @@ class VisionProjectorType:
QWEN2VL = "qwen2vl_merger"
QWEN25VL = "qwen2.5vl_merger"
QWEN3VL = "qwen3vl_merger"
STEP3VL = "step3vl"
ULTRAVOX = "ultravox"
INTERNVL = "internvl"
QWEN2A = "qwen2a" # audio
@@ -4113,6 +4128,7 @@ class VisionProjectorType:
GLM4V = "glm4v"
YOUTUVL = "youtuvl"
NEMOTRON_V2_VL = "nemotron_v2_vl"
HUNYUANOCR = "hunyuanocr"
# Items here are (block size, type size)
@@ -4151,6 +4167,7 @@ GGML_QUANT_SIZES: dict[GGMLQuantizationType, tuple[int, int]] = {
GGMLQuantizationType.TQ2_0: (256, 2 + 64),
GGMLQuantizationType.MXFP4: (32, 1 + 16),
GGMLQuantizationType.NVFP4: (64, 4 + 32),
GGMLQuantizationType.Q1_0: (128, 2 + 16),
}
+39 -1
View File
@@ -1359,6 +1359,7 @@ class TensorNameMap:
"visual.merger.mlp.{bid}", # qwen2vl
"mlp_AR.linear_{bid}", # PaddleOCR-VL
"merger.mlp.{bid}",
"vit.perceive.proj.{bid}", # HunyuanOCR (proj.0 = conv1, proj.2 = conv2)
),
MODEL_TENSOR.V_MMPROJ_FC: (
@@ -1366,6 +1367,7 @@ class TensorNameMap:
"model.vision.linear_proj.linear_proj", # cogvlm
"model.projector.layers", # Deepseek-OCR
"visual.merger.proj", # glm4v
"vit.perceive.mlp", # HunyuanOCR
),
MODEL_TENSOR.V_MMPROJ_MLP: (
@@ -1393,6 +1395,7 @@ class TensorNameMap:
"model.vision_tower.embeddings.patch_embeddings.projection", # Intern-S1
"vpm.embeddings.patch_embedding",
"model.vision_model.embeddings.patch_embedding", # SmolVLM
"vit.embeddings.patch_embedding", # HunyuanOCR
"vision_tower.patch_conv", # pixtral-hf
"vision_encoder.patch_conv", # pixtral
"vision_model.patch_embedding.linear", # llama 4
@@ -1403,6 +1406,7 @@ class TensorNameMap:
"siglip2.vision_model.embeddings.patch_embedding",
"vision_model.radio_model.model.patch_generator.embedder", # Nemotron Nano v2 VL
"model.vision_tower.patch_embedder.input_proj", # gemma4
"vision_model.conv1", # Step3-VL
),
MODEL_TENSOR.V_ENC_EMBD_NORM: (
@@ -1414,6 +1418,7 @@ class TensorNameMap:
"model.vision_tower.embeddings.position_embeddings", # Intern-S1
"vpm.embeddings.position_embedding",
"model.vision_model.embeddings.position_embedding", # SmolVLM
"vit.embeddings.position_embedding", # HunyuanOCR
"vision_model.positional_embedding_vlm", # llama 4
"vision_tower.patch_embed.pos_emb", # kimi-vl
"visual.pos_embed", # qwen3vl
@@ -1421,22 +1426,26 @@ class TensorNameMap:
"visual.embeddings.position_embedding", # glm4v
"vision_model.radio_model.model.patch_generator.pos_embed", # Nemotron Nano v2 VL
"model.vision_tower.patch_embedder.position_embedding_table", # gemma4
"vision_model.positional_embedding", # Step3-VL
),
MODEL_TENSOR.V_ENC_EMBD_IMGNL: (
"model.image_newline", # Deepseek-OCR
"vit.perceive.image_newline", # HunyuanOCR
),
MODEL_TENSOR.V_ENC_EMBD_VSEP: (
"model.view_seperator", # Deepseek-OCR
"vit.perceive.image_sep", # HunyuanOCR
),
MODEL_TENSOR.V_ENC_ATTN_QKV: (
"visual.blocks.{bid}.attn.qkv", # qwen3vl
"model.vision.transformer.layers.{bid}.attention.query_key_value", # cogvlm
"model.vision_model.transformer.layers.{bid}.self_attn.qkv_proj", # Deepseek-OCR CLIP
"vision_tower.encoder.blocks.{bid}.wqkv" # Kimi-K2.5
"vision_tower.encoder.blocks.{bid}.wqkv", # Kimi-K2.5
"vision_model.radio_model.model.blocks.{bid}.attn.qkv", # Nemotron Nano v2 VL
"vision_model.transformer.resblocks.{bid}.attn.in_proj", # Step3-VL
),
MODEL_TENSOR.V_ENC_ATTN_Q: (
@@ -1444,6 +1453,7 @@ class TensorNameMap:
"model.vision_tower.encoder.layer.{bid}.attention.q_proj", # Intern-S1
"vpm.encoder.layers.{bid}.self_attn.q_proj",
"model.vision_model.encoder.layers.{bid}.self_attn.q_proj", # SmolVLM
"vit.layers.{bid}.self_attn.q_proj", # HunyuanOCR
"vision_model.model.layers.{bid}.self_attn.q_proj", # llama4
"vision_tower.transformer.layers.{bid}.attention.q_proj", # pixtral-hf
"vision_encoder.transformer.layers.{bid}.attention.wq", # pixtral
@@ -1466,6 +1476,7 @@ class TensorNameMap:
"model.vision_tower.encoder.layer.{bid}.attention.k_proj", # Intern-S1
"vpm.encoder.layers.{bid}.self_attn.k_proj",
"model.vision_model.encoder.layers.{bid}.self_attn.k_proj", # SmolVLM
"vit.layers.{bid}.self_attn.k_proj", # HunyuanOCR
"vision_model.model.layers.{bid}.self_attn.k_proj", # llama4
"vision_tower.transformer.layers.{bid}.attention.k_proj", # pixtral-hf
"vision_encoder.transformer.layers.{bid}.attention.wk", # pixtral
@@ -1488,6 +1499,7 @@ class TensorNameMap:
"model.vision_tower.encoder.layer.{bid}.attention.v_proj", # Intern-S1
"vpm.encoder.layers.{bid}.self_attn.v_proj",
"model.vision_model.encoder.layers.{bid}.self_attn.v_proj", # SmolVLM
"vit.layers.{bid}.self_attn.v_proj", # HunyuanOCR
"vision_model.model.layers.{bid}.self_attn.v_proj", # llama4
"vision_tower.transformer.layers.{bid}.attention.v_proj", # pixtral-hf
"vision_encoder.transformer.layers.{bid}.attention.wv", # pixtral
@@ -1504,6 +1516,7 @@ class TensorNameMap:
"model.vision_tower.encoder.layer.{bid}.layernorm_before", # Intern-S1
"vpm.encoder.layers.{bid}.layer_norm1",
"model.vision_model.encoder.layers.{bid}.layer_norm1", # SmolVLM
"vit.layers.{bid}.input_layernorm", # HunyuanOCR
"vision_tower.transformer.layers.{bid}.attention_norm", # pixtral-hf
"vision_encoder.transformer.layers.{bid}.attention_norm", # pixtral
"vision_model.model.layers.{bid}.input_layernorm", # llama4, gemma4
@@ -1513,6 +1526,7 @@ class TensorNameMap:
"model.vision_model.transformer.layers.{bid}.layer_norm1", # Deepseek-OCR CLIP
"siglip2.vision_model.encoder.layers.{bid}.layer_norm1",
"vision_model.radio_model.model.blocks.{bid}.norm1", # Nemotron Nano v2 VL
"vision_model.transformer.resblocks.{bid}.ln_1", # Step3-VL
),
MODEL_TENSOR.V_ENC_ATTN_O: (
@@ -1521,6 +1535,7 @@ class TensorNameMap:
"model.vision_tower.encoder.layer.{bid}.attention.projection_layer", # Intern-S1
"vpm.encoder.layers.{bid}.self_attn.out_proj",
"model.vision_model.encoder.layers.{bid}.self_attn.out_proj", # SmolVLM
"vit.layers.{bid}.self_attn.o_proj", # HunyuanOCR
"model.vision_model.encoder.layers.{bid}.self_attn.projection_layer", # Janus Pro
"vision_model.model.layers.{bid}.self_attn.o_proj", # llama4
"vision_tower.transformer.layers.{bid}.attention.o_proj", # pixtral-hf
@@ -1532,6 +1547,7 @@ class TensorNameMap:
"siglip2.vision_model.encoder.layers.{bid}.self_attn.out_proj", # youtuvl
"vision_model.radio_model.model.blocks.{bid}.attn.proj", # Nemotron Nano v2 VL
"vision_model.model.layers.{bid}.self_attn.o_proj.linear", # gemma4
"vision_model.transformer.resblocks.{bid}.attn.out_proj", # Step3-VL
),
MODEL_TENSOR.V_ENC_POST_ATTN_NORM: (
@@ -1540,6 +1556,7 @@ class TensorNameMap:
"model.vision_tower.encoder.layer.{bid}.layernorm_after", # Intern-S1
"vpm.encoder.layers.{bid}.layer_norm2",
"model.vision_model.encoder.layers.{bid}.layer_norm2", # SmolVLM
"vit.layers.{bid}.post_attention_layernorm", # HunyuanOCR
"vision_model.model.layers.{bid}.post_attention_layernorm", # llama4
"vision_tower.transformer.layers.{bid}.ffn_norm", # pixtral-hf
"vision_encoder.transformer.layers.{bid}.ffn_norm", # pixtral
@@ -1550,6 +1567,7 @@ class TensorNameMap:
"siglip2.vision_model.encoder.layers.{bid}.layer_norm2",
"vision_model.radio_model.model.blocks.{bid}.norm2", # Nemotron Nano v2 VL
"vision_model.model.layers.{bid}.pre_feedforward_layernorm", # gemma4
"vision_model.transformer.resblocks.{bid}.ln_2", # Step3-VL
),
MODEL_TENSOR.V_ENC_FFN_UP: (
@@ -1557,6 +1575,7 @@ class TensorNameMap:
"model.vision_tower.encoder.layer.{bid}.mlp.fc1", # Intern-S1
"vpm.encoder.layers.{bid}.mlp.fc1",
"model.vision_model.encoder.layers.{bid}.mlp.fc1", # SmolVLM, gemma3
"vit.layers.{bid}.mlp.dense_h_to_4h", # HunyuanOCR
"vision_tower.transformer.layers.{bid}.feed_forward.up_proj", # pixtral-hf
"vision_encoder.transformer.layers.{bid}.feed_forward.w3", # pixtral
"vision_model.model.layers.{bid}.mlp.fc1", # llama4
@@ -1569,6 +1588,7 @@ class TensorNameMap:
"siglip2.vision_model.encoder.layers.{bid}.mlp.fc1",
"vision_model.radio_model.model.blocks.{bid}.mlp.fc1", # Nemotron Nano v2 VL
"vision_model.model.layers.{bid}.mlp.up_proj", # gemma4
"vision_model.transformer.resblocks.{bid}.mlp.c_fc", # Step3-VL
),
MODEL_TENSOR.V_ENC_FFN_GATE: (
@@ -1583,6 +1603,7 @@ class TensorNameMap:
"model.vision_tower.encoder.layer.{bid}.mlp.fc2", # Intern-S1
"vpm.encoder.layers.{bid}.mlp.fc2",
"model.vision_model.encoder.layers.{bid}.mlp.fc2", # SmolVLM, gemma3
"vit.layers.{bid}.mlp.dense_4h_to_h", # HunyuanOCR
"vision_tower.transformer.layers.{bid}.feed_forward.down_proj", # pixtral-hf
"vision_encoder.transformer.layers.{bid}.feed_forward.w2", # pixtral
"vision_model.model.layers.{bid}.mlp.fc2", # llama4
@@ -1595,6 +1616,7 @@ class TensorNameMap:
"siglip2.vision_model.encoder.layers.{bid}.mlp.fc2",
"vision_model.radio_model.model.blocks.{bid}.mlp.fc2", # Nemotron Nano v2 VL
"vision_model.model.layers.{bid}.mlp.down_proj", # gemma4
"vision_model.transformer.resblocks.{bid}.mlp.c_proj", # Step3-VL
),
MODEL_TENSOR.V_ENC_ATTN_POST_NORM: (
@@ -1608,11 +1630,13 @@ class TensorNameMap:
MODEL_TENSOR.V_LAYER_SCALE_1: (
"vision_tower.vision_model.encoder.layers.{bid}.ls1", # InternVL
"model.vision_tower.encoder.layer.{bid}.lambda_1", # Intern-S1
"vision_model.transformer.resblocks.{bid}.ls_1", # Step3-VL
),
MODEL_TENSOR.V_LAYER_SCALE_2: (
"vision_tower.vision_model.encoder.layers.{bid}.ls2", # InternVL
"model.vision_tower.encoder.layer.{bid}.lambda_2", # Intern-S1
"vision_model.transformer.resblocks.{bid}.ls_2", # Step3-VL
),
MODEL_TENSOR.V_LAYER_OUT_SCALE: (
@@ -1625,6 +1649,7 @@ class TensorNameMap:
"vision_encoder.ln_pre", # pixtral
"vision_model.layernorm_pre", # llama4
"model.vision_model.pre_layrnorm", # Deepseek-OCR CLIP
"vision_model.ln_pre", # Step3-VL
),
MODEL_TENSOR.V_POST_NORM: (
@@ -1639,6 +1664,7 @@ class TensorNameMap:
MODEL_TENSOR.V_MM_POST_NORM: (
"visual.merger.post_projection_norm", # glm4v
"vit.perceive.after_rms", # HunyuanOCR
),
MODEL_TENSOR.V_MM_INP_PROJ: (
@@ -1806,6 +1832,18 @@ class TensorNameMap:
"model.vision.eoi", # cogvlm
),
MODEL_TENSOR.V_MM_PRE_NORM: (
"vit.perceive.before_rms", # HunyuanOCR
),
MODEL_TENSOR.V_TOK_IMG_BEGIN: (
"vit.perceive.image_begin", # HunyuanOCR
),
MODEL_TENSOR.V_TOK_IMG_END: (
"vit.perceive.image_end", # HunyuanOCR
),
MODEL_TENSOR.V_STD_BIAS: (
"model.vision_tower.std_bias", # gemma4
),
+1
View File
@@ -154,6 +154,7 @@ extern "C" {
LLAMA_FTYPE_MOSTLY_TQ2_0 = 37, // except 1d tensors
LLAMA_FTYPE_MOSTLY_MXFP4_MOE = 38, // except 1d tensors
LLAMA_FTYPE_MOSTLY_NVFP4 = 39, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q1_0 = 40, // except 1d tensors
LLAMA_FTYPE_GUESSED = 1024, // not specified in the model file
};
@@ -0,0 +1,282 @@
{%- macro format_parameters(properties, required) -%}
{%- set standard_keys = ['description', 'type', 'properties', 'required', 'nullable'] -%}
{%- set ns = namespace(found_first=false) -%}
{%- for key, value in properties | dictsort -%}
{%- set add_comma = false -%}
{%- if key not in standard_keys -%}
{%- if ns.found_first %},{% endif -%}
{%- set ns.found_first = true -%}
{{ key }}:{
{%- if value['description'] -%}
description:<|"|>{{ value['description'] }}<|"|>
{%- set add_comma = true -%}
{%- endif -%}
{%- if value['nullable'] %}
{%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%}
nullable:true
{%- endif -%}
{%- if value['type'] | upper == 'STRING' -%}
{%- if value['enum'] -%}
{%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%}
enum:{{ format_argument(value['enum']) }}
{%- endif -%}
{%- elif value['type'] | upper == 'OBJECT' -%}
,properties:{
{%- if value['properties'] is defined and value['properties'] is mapping -%}
{{- format_parameters(value['properties'], value['required'] | default([])) -}}
{%- elif value is mapping -%}
{{- format_parameters(value, value['required'] | default([])) -}}
{%- endif -%}
}
{%- if value['required'] -%}
,required:[
{%- for item in value['required'] | default([]) -%}
<|"|>{{- item -}}<|"|>
{%- if not loop.last %},{% endif -%}
{%- endfor -%}
]
{%- endif -%}
{%- elif value['type'] | upper == 'ARRAY' -%}
{%- if value['items'] is mapping and value['items'] -%}
,items:{
{%- set ns_items = namespace(found_first=false) -%}
{%- for item_key, item_value in value['items'] | dictsort -%}
{%- if item_value is not none -%}
{%- if ns_items.found_first %},{% endif -%}
{%- set ns_items.found_first = true -%}
{%- if item_key == 'properties' -%}
properties:{
{%- if item_value is mapping -%}
{{- format_parameters(item_value, value['items']['required'] | default([])) -}}
{%- endif -%}
}
{%- elif item_key == 'required' -%}
required:[
{%- for req_item in item_value -%}
<|"|>{{- req_item -}}<|"|>
{%- if not loop.last %},{% endif -%}
{%- endfor -%}
]
{%- elif item_key == 'type' -%}
{%- if item_value is string -%}
type:{{ format_argument(item_value | upper) }}
{%- else -%}
type:{{ format_argument(item_value | map('upper') | list) }}
{%- endif -%}
{%- else -%}
{{ item_key }}:{{ format_argument(item_value) }}
{%- endif -%}
{%- endif -%}
{%- endfor -%}
}
{%- endif -%}
{%- endif -%}
{%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%}
type:<|"|>{{ value['type'] | upper }}<|"|>}
{%- endif -%}
{%- endfor -%}
{%- endmacro -%}
{%- macro format_function_declaration(tool_data) -%}
declaration:{{- tool_data['function']['name'] -}}{description:<|"|>{{- tool_data['function']['description'] -}}<|"|>
{%- set params = tool_data['function']['parameters'] -%}
{%- if params -%}
,parameters:{
{%- if params['properties'] -%}
properties:{ {{- format_parameters(params['properties'], params['required']) -}} },
{%- endif -%}
{%- if params['required'] -%}
required:[
{%- for item in params['required'] -%}
<|"|>{{- item -}}<|"|>
{{- ',' if not loop.last -}}
{%- endfor -%}
],
{%- endif -%}
{%- if params['type'] -%}
type:<|"|>{{- params['type'] | upper -}}<|"|>}
{%- endif -%}
{%- endif -%}
{%- if 'response' in tool_data['function'] -%}
{%- set response_declaration = tool_data['function']['response'] -%}
,response:{
{%- if response_declaration['description'] -%}
description:<|"|>{{- response_declaration['description'] -}}<|"|>,
{%- endif -%}
{%- if response_declaration['type'] | upper == 'OBJECT' -%}
type:<|"|>{{- response_declaration['type'] | upper -}}<|"|>}
{%- endif -%}
{%- endif -%}
}
{%- endmacro -%}
{%- macro format_argument(argument, escape_keys=True) -%}
{%- if argument is string -%}
{{- '<|"|>' + argument + '<|"|>' -}}
{%- elif argument is boolean -%}
{{- 'true' if argument else 'false' -}}
{%- elif argument is mapping -%}
{{- '{' -}}
{%- set ns = namespace(found_first=false) -%}
{%- for key, value in argument | dictsort -%}
{%- if ns.found_first %},{% endif -%}
{%- set ns.found_first = true -%}
{%- if escape_keys -%}
{{- '<|"|>' + key + '<|"|>' -}}
{%- else -%}
{{- key -}}
{%- endif -%}
:{{- format_argument(value, escape_keys=escape_keys) -}}
{%- endfor -%}
{{- '}' -}}
{%- elif argument is sequence -%}
{{- '[' -}}
{%- for item in argument -%}
{{- format_argument(item, escape_keys=escape_keys) -}}
{%- if not loop.last %},{% endif -%}
{%- endfor -%}
{{- ']' -}}
{%- else -%}
{{- argument -}}
{%- endif -%}
{%- endmacro -%}
{%- macro strip_thinking(text) -%}
{%- set ns = namespace(result='') -%}
{%- for part in text.split('<channel|>') -%}
{%- if '<|channel>' in part -%}
{%- set ns.result = ns.result + part.split('<|channel>')[0] -%}
{%- else -%}
{%- set ns.result = ns.result + part -%}
{%- endif -%}
{%- endfor -%}
{{- ns.result | trim -}}
{%- endmacro -%}
{%- set ns = namespace(prev_message_type=None, last_user_message=-1) -%}
{%- set loop_messages = messages -%}
{{ bos_token }}
{#- Handle System/Tool Definitions Block -#}
{%- if (enable_thinking is defined and enable_thinking) or tools or messages[0]['role'] in ['system', 'developer'] -%}
{{- '<|turn>system\n' -}}
{#- Inject Thinking token at the very top of the FIRST system turn -#}
{%- if enable_thinking is defined and enable_thinking -%}
{{- '<|think|>' -}}
{%- set ns.prev_message_type = 'think' -%}
{%- endif -%}
{%- if messages[0]['role'] in ['system', 'developer'] -%}
{{- messages[0]['content'] | trim -}}
{%- set loop_messages = messages[1:] -%}
{%- endif -%}
{%- if tools -%}
{%- for tool in tools %}
{{- '<|tool>' -}}
{{- format_function_declaration(tool) | trim -}}
{{- '<tool|>' -}}
{%- endfor %}
{%- set ns.prev_message_type = 'tool' -%}
{%- endif -%}
{{- '<turn|>\n' -}}
{%- endif %}
{#- Find last user message -#}
{%- for message in loop_messages -%}
{%- if message['role'] == 'user' -%}
{%- set ns.last_user_message = loop.index0 -%}
{%- endif -%}
{%- endfor -%}
{#- Loop through messages -#}
{%- for message in loop_messages -%}
{%- set role = 'model' if message['role'] == 'assistant' else message['role'] -%}
{%- if not (ns.prev_message_type == 'tool_response' and message['tool_calls']) -%}
{{- '<|turn>' + role + '\n' }}
{%- endif -%}
{%- set ns.prev_message_type = None -%}
{%- if message['tool_calls'] -%}
{#- Preserve reasoning between tool calls for model turns that come after the last user turn -#}
{%- if message['reasoning_content'] and loop.index0 > ns.last_user_message -%}
{{- '<|channel>thought\n' -}}
{{- message['reasoning_content'] -}}
{{- '<channel|>' -}}
{%- endif -%}
{%- for tool_call in message['tool_calls'] -%}
{%- set function = tool_call['function'] -%}
{{- '<|tool_call>call:' + function['name'] + '{' -}}
{%- if function['arguments'] is mapping -%}
{%- set ns_args = namespace(found_first=false) -%}
{%- for key, value in function['arguments'] | dictsort -%}
{%- if ns_args.found_first %},{% endif -%}
{%- set ns_args.found_first = true -%}
{{- key -}}:{{- format_argument(value, escape_keys=False) -}}
{%- endfor -%}
{%- elif function['arguments'] is string -%}
{{- function['arguments'] -}}
{%- endif -%}
{{- '}<tool_call|>' -}}
{%- endfor -%}
{%- set ns.prev_message_type = 'tool_call' -%}
{%- endif -%}
{%- if message['tool_responses'] -%}
{#- Tool Response handling -#}
{%- for tool_response in message['tool_responses'] -%}
{{- '<|tool_response>' -}}
{%- if tool_response['response'] is mapping -%}
{{- 'response:' + tool_response['name'] | default('unknown') + '{' -}}
{%- for key, value in tool_response['response'] | dictsort -%}
{{- key -}}:{{- format_argument(value, escape_keys=False) -}}
{%- if not loop.last %},{% endif -%}
{%- endfor -%}
{{- '}' -}}
{%- else -%}
{{- 'response:' + tool_response['name'] | default('unknown') + '{value:' + format_argument(tool_response['response'], escape_keys=False) + '}' -}}
{%- endif -%}
{{- '<tool_response|>' -}}
{%- endfor -%}
{%- set ns.prev_message_type = 'tool_response' -%}
{%- endif -%}
{%- if message['content'] is string -%}
{%- if role == 'model' -%}
{{- strip_thinking(message['content']) -}}
{%- else -%}
{{- message['content'] | trim -}}
{%- endif -%}
{%- elif message['content'] is sequence -%}
{%- for item in message['content'] -%}
{%- if item['type'] == 'text' -%}
{%- if role == 'model' -%}
{{- strip_thinking(item['text']) -}}
{%- else -%}
{{- item['text'] | trim -}}
{%- endif -%}
{%- elif item['type'] == 'image' -%}
{{- '\n\n<|image|>\n\n' -}}
{%- set ns.prev_message_type = 'image' -%}
{%- elif item['type'] == 'audio' -%}
{{- '<|audio|>' -}}
{%- set ns.prev_message_type = 'audio' -%}
{%- elif item['type'] == 'video' -%}
{{- '\n\n<|video|>\n\n' -}}
{%- set ns.prev_message_type = 'video' -%}
{%- endif -%}
{%- endfor -%}
{%- endif -%}
{%- if not (message['tool_responses'] and not message['content']) -%}
{{- '<turn|>\n' -}}
{%- endif -%}
{%- endfor -%}
{%- if add_generation_prompt -%}
{%- if ns.prev_message_type != 'tool_response' -%}
{{- '<|turn>model\n' -}}
{%- endif -%}
{%- if not enable_thinking | default(false) -%}
{{- '<|channel>thought\n<channel|>' -}}
{%- endif -%}
{%- endif -%}
+5 -2
View File
@@ -29,7 +29,8 @@ LLAMA_BENCH_DB_FIELDS = [
"cpu_mask", "cpu_strict", "poll", "type_k", "type_v", "n_gpu_layers",
"split_mode", "main_gpu", "no_kv_offload", "flash_attn", "tensor_split", "tensor_buft_overrides",
"use_mmap", "embeddings", "no_op_offload", "n_prompt", "n_gen", "n_depth",
"test_time", "avg_ns", "stddev_ns", "avg_ts", "stddev_ts", "n_cpu_moe"
"test_time", "avg_ns", "stddev_ns", "avg_ts", "stddev_ts", "n_cpu_moe",
"fit_target", "fit_min_ctx"
]
LLAMA_BENCH_DB_TYPES = [
@@ -39,6 +40,7 @@ LLAMA_BENCH_DB_TYPES = [
"TEXT", "INTEGER", "INTEGER", "INTEGER", "TEXT", "TEXT",
"INTEGER", "INTEGER", "INTEGER", "INTEGER", "INTEGER", "INTEGER",
"TEXT", "INTEGER", "INTEGER", "REAL", "REAL", "INTEGER",
"INTEGER", "INTEGER"
]
# All test-backend-ops SQL fields
@@ -61,7 +63,8 @@ assert len(TEST_BACKEND_OPS_DB_FIELDS) == len(TEST_BACKEND_OPS_DB_TYPES)
LLAMA_BENCH_KEY_PROPERTIES = [
"cpu_info", "gpu_info", "backends", "n_gpu_layers", "n_cpu_moe", "tensor_buft_overrides", "model_filename", "model_type",
"n_batch", "n_ubatch", "embeddings", "cpu_mask", "cpu_strict", "poll", "n_threads", "type_k", "type_v",
"use_mmap", "no_kv_offload", "split_mode", "main_gpu", "tensor_split", "flash_attn", "n_prompt", "n_gen", "n_depth"
"use_mmap", "no_kv_offload", "split_mode", "main_gpu", "tensor_split", "flash_attn", "n_prompt", "n_gen", "n_depth",
"fit_target", "fit_min_ctx"
]
# Properties by which to differentiate results per commit for test-backend-ops:
+17 -2041
View File
File diff suppressed because it is too large Load Diff
-2
View File
@@ -585,8 +585,6 @@ struct LLM_TN_IMPL {
const int bid;
const int xid;
const std::set<llm_tensor> model_tensors;
LLM_TN_IMPL(llm_arch arch, llm_tensor tensor, const char * suffix, int bid, int xid);
std::string str() const;
+19
View File
@@ -73,6 +73,7 @@ static const std::map<std::string, llm_chat_template> LLM_CHAT_TEMPLATES = {
{ "hunyuan-moe", LLM_CHAT_TEMPLATE_HUNYUAN_MOE },
{ "gpt-oss", LLM_CHAT_TEMPLATE_OPENAI_MOE },
{ "hunyuan-dense", LLM_CHAT_TEMPLATE_HUNYUAN_DENSE },
{ "hunyuan-ocr", LLM_CHAT_TEMPLATE_HUNYUAN_OCR },
{ "kimi-k2", LLM_CHAT_TEMPLATE_KIMI_K2 },
{ "seed_oss", LLM_CHAT_TEMPLATE_SEED_OSS },
{ "grok-2", LLM_CHAT_TEMPLATE_GROK_2 },
@@ -216,6 +217,8 @@ llm_chat_template llm_chat_detect_template(const std::string & tmpl) {
return LLM_CHAT_TEMPLATE_HUNYUAN_MOE;
} else if (tmpl_contains("<|start|>") && tmpl_contains("<|channel|>")) {
return LLM_CHAT_TEMPLATE_OPENAI_MOE;
} else if (tmpl_contains("<hy_Assistant>") && tmpl_contains("<hy_begin▁of▁sentence>")) {
return LLM_CHAT_TEMPLATE_HUNYUAN_OCR;
} else if (tmpl_contains("<hy_Assistant>") && tmpl_contains("<hy_place▁holder▁no▁3>")) {
return LLM_CHAT_TEMPLATE_HUNYUAN_DENSE;
} else if (tmpl_contains("<|im_assistant|>assistant<|im_middle|>")) {
@@ -822,6 +825,22 @@ int32_t llm_chat_apply_template(
ss << "<hy_User>" << chat[i]->content << "<hy_Assistant>";
}
}
} else if (tmpl == LLM_CHAT_TEMPLATE_HUNYUAN_OCR) {
// tencent/HunyuanOCR
ss << "<hy_begin▁of▁sentence>";
for (size_t i = 0; i < chat.size(); i++) {
std::string role(chat[i]->role);
if (i == 0 && role == "system") {
ss << chat[i]->content << "<hy_place▁holder▁no▁3>";
continue;
}
if (role == "user") {
ss << chat[i]->content << "<hy_User>";
} else if (role == "assistant") {
ss << chat[i]->content << "<hy_Assistant>";
}
}
} else if (tmpl == LLM_CHAT_TEMPLATE_KIMI_K2) {
// moonshotai/Kimi-K2-Instruct
for (auto message : chat) {
+1
View File
@@ -53,6 +53,7 @@ enum llm_chat_template {
LLM_CHAT_TEMPLATE_HUNYUAN_MOE,
LLM_CHAT_TEMPLATE_OPENAI_MOE,
LLM_CHAT_TEMPLATE_HUNYUAN_DENSE,
LLM_CHAT_TEMPLATE_HUNYUAN_OCR,
LLM_CHAT_TEMPLATE_KIMI_K2,
LLM_CHAT_TEMPLATE_SEED_OSS,
LLM_CHAT_TEMPLATE_GROK_2,
+2 -2
View File
@@ -2942,7 +2942,7 @@ llama_context * llama_init_from_model(
params.flash_attn_type = LLAMA_FLASH_ATTN_TYPE_DISABLED;
}
if (params.flash_attn_type == LLAMA_FLASH_ATTN_TYPE_AUTO && ggml_is_quantized(params.type_k)) {
if (params.flash_attn_type != LLAMA_FLASH_ATTN_TYPE_DISABLED && ggml_is_quantized(params.type_k)) {
const uint32_t blck_size = ggml_blck_size(params.type_k);
for (uint32_t il = 0; il < model->hparams.n_layer; ++il) {
if (model->hparams.n_embd_head_k(il) % blck_size != 0) {
@@ -2953,7 +2953,7 @@ llama_context * llama_init_from_model(
}
}
if (params.flash_attn_type == LLAMA_FLASH_ATTN_TYPE_AUTO && ggml_is_quantized(params.type_v)) {
if (params.flash_attn_type != LLAMA_FLASH_ATTN_TYPE_DISABLED && ggml_is_quantized(params.type_v)) {
const uint32_t blck_size = ggml_blck_size(params.type_v);
for (uint32_t il = 0; il < model->hparams.n_layer; ++il) {
if (model->hparams.n_embd_head_v(il) % blck_size != 0) {
+31 -9
View File
@@ -511,6 +511,14 @@ void llm_graph_input_attn_kv_iswa::set_input(const llama_ubatch * ubatch) {
if (self_v_rot) {
mctx->get_base()->set_input_v_rot(self_v_rot);
}
if (self_k_rot_swa) {
mctx->get_swa()->set_input_k_rot(self_k_rot_swa);
}
if (self_v_rot_swa) {
mctx->get_swa()->set_input_v_rot(self_v_rot_swa);
}
}
bool llm_graph_input_attn_kv_iswa::can_reuse(const llm_graph_params & params) {
@@ -681,6 +689,14 @@ void llm_graph_input_mem_hybrid_iswa::set_input(const llama_ubatch * ubatch) {
attn_ctx->get_base()->set_input_v_rot(inp_attn->self_v_rot);
}
if (inp_attn->self_k_rot_swa) {
attn_ctx->get_swa()->set_input_k_rot(inp_attn->self_k_rot_swa);
}
if (inp_attn->self_v_rot_swa) {
attn_ctx->get_swa()->set_input_v_rot(inp_attn->self_v_rot_swa);
}
const int64_t n_rs = mctx->get_recr()->get_n_rs();
if (inp_rs->s_copy) {
@@ -2233,15 +2249,20 @@ ggml_tensor * llm_graph_context::build_attn(
ggml_tensor * v_mla,
float kq_scale,
int il) const {
if (inp->self_k_rot) {
q_cur = ggml_mul_mat_aux(ctx0, q_cur, inp->self_k_rot);
const bool is_swa = hparams.is_swa(il);
auto * k_rot = is_swa ? inp->self_k_rot_swa : inp->self_k_rot;
auto * v_rot = is_swa ? inp->self_v_rot_swa : inp->self_v_rot;
if (k_rot) {
q_cur = ggml_mul_mat_aux(ctx0, q_cur, k_rot);
if (k_cur) {
k_cur = ggml_mul_mat_aux(ctx0, k_cur, inp->self_k_rot);
k_cur = ggml_mul_mat_aux(ctx0, k_cur, k_rot);
}
}
if (inp->self_v_rot) {
if (v_rot) {
if (v_cur) {
v_cur = ggml_mul_mat_aux(ctx0, v_cur, inp->self_v_rot);
v_cur = ggml_mul_mat_aux(ctx0, v_cur, v_rot);
}
}
@@ -2259,8 +2280,6 @@ ggml_tensor * llm_graph_context::build_attn(
const auto * mctx_iswa = inp->mctx;
const bool is_swa = hparams.is_swa(il);
const auto * mctx_cur = is_swa ? mctx_iswa->get_swa() : mctx_iswa->get_base();
// optionally store to KV cache
@@ -2285,8 +2304,8 @@ ggml_tensor * llm_graph_context::build_attn(
ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask, sinks, v_mla, kq_scale, il);
cb(cur, "kqv_out", il);
if (inp->self_v_rot) {
cur = ggml_mul_mat_aux(ctx0, cur, inp->self_v_rot);
if (v_rot) {
cur = ggml_mul_mat_aux(ctx0, cur, v_rot);
}
if (wo) {
@@ -2388,6 +2407,9 @@ llm_graph_input_attn_kv_iswa * llm_graph_context::build_attn_inp_kv_iswa() const
inp->self_k_rot = mctx_cur->get_base()->build_input_k_rot(ctx0);
inp->self_v_rot = mctx_cur->get_base()->build_input_v_rot(ctx0);
inp->self_k_rot_swa = mctx_cur->get_swa()->build_input_k_rot(ctx0);
inp->self_v_rot_swa = mctx_cur->get_swa()->build_input_v_rot(ctx0);
return (llm_graph_input_attn_kv_iswa *) res->add_input(std::move(inp));
}
+4 -2
View File
@@ -308,7 +308,7 @@ public:
ggml_tensor * self_kq_mask = nullptr; // F32 [n_kv, n_batch/n_stream, 1, n_stream]
ggml_tensor * self_kq_mask_cnv = nullptr; // [n_kv, n_batch/n_stream, 1, n_stream]
// note: assumes v_rot^ == I
// note: assumes v_rot^2 == I
ggml_tensor * self_k_rot = nullptr;
ggml_tensor * self_v_rot = nullptr;
@@ -388,10 +388,12 @@ public:
ggml_tensor * self_kq_mask_swa = nullptr; // F32 [n_kv, n_batch/n_stream, 1, n_stream]
ggml_tensor * self_kq_mask_swa_cnv = nullptr; // [n_kv, n_batch/n_stream, 1, n_stream]
// note: using same rotation matrices for both base and swa cache
ggml_tensor * self_k_rot = nullptr;
ggml_tensor * self_v_rot = nullptr;
ggml_tensor * self_k_rot_swa = nullptr;
ggml_tensor * self_v_rot_swa = nullptr;
const llama_hparams hparams;
const llama_cparams cparams;
+1 -1
View File
@@ -128,7 +128,7 @@ static std::string gguf_data_to_str(enum gguf_type type, const void * data, int
case GGUF_TYPE_INT64: return std::to_string(((const int64_t *)data)[i]);
case GGUF_TYPE_FLOAT32: return std::to_string(((const float *)data)[i]);
case GGUF_TYPE_FLOAT64: return std::to_string(((const double *)data)[i]);
case GGUF_TYPE_BOOL: return ((const bool *)data)[i] ? "true" : "false";
case GGUF_TYPE_BOOL: return ((const int8_t *)data)[i] != 0 ? "true" : "false";
default: return format("unknown type %d", type);
}
}
+18 -6
View File
@@ -169,6 +169,18 @@ llama_kv_cache::llama_kv_cache(
continue;
}
if (n_embd_head_k_all == 0) {
n_embd_head_k_all = (int32_t) hparams.n_embd_head_k(il);
} else if (n_embd_head_k_all > 0 && n_embd_head_k_all != (int32_t) hparams.n_embd_head_k(il)) {
n_embd_head_k_all = -1;
}
if (n_embd_head_v_all == 0) {
n_embd_head_v_all = (int32_t) hparams.n_embd_head_v(il);
} else if (n_embd_head_v_all > 0 && n_embd_head_v_all != (int32_t) hparams.n_embd_head_v(il)) {
n_embd_head_v_all = -1;
}
// [TAG_V_CACHE_VARIABLE]
const uint32_t n_embd_k_gqa = hparams.n_embd_k_gqa(il);
const uint32_t n_embd_v_gqa = !v_trans ? hparams.n_embd_v_gqa(il) : hparams.n_embd_v_gqa_max();
@@ -276,23 +288,23 @@ llama_kv_cache::llama_kv_cache(
attn_rot_k =
!attn_rot_disable &&
n_embd_head_k_all > 0 &&
ggml_is_quantized(type_k) &&
!hparams.is_n_embd_k_gqa_variable() &&
hparams.n_embd_head_k() % 64 == 0;
attn_rot_v =
!attn_rot_disable &&
n_embd_head_v_all > 0 &&
ggml_is_quantized(type_v) &&
!hparams.is_n_embd_v_gqa_variable() &&
hparams.n_embd_head_v() % 64 == 0;
LLAMA_LOG_INFO("%s: attn_rot_k = %d\n", __func__, attn_rot_k);
LLAMA_LOG_INFO("%s: attn_rot_v = %d\n", __func__, attn_rot_v);
LLAMA_LOG_INFO("%s: attn_rot_k = %d, n_embd_head_k_all = %d\n", __func__, attn_rot_k, n_embd_head_k_all);
LLAMA_LOG_INFO("%s: attn_rot_v = %d, n_embd_head_k_all = %d\n", __func__, attn_rot_v, n_embd_head_v_all);
// pre-compute the haramard matrices and keep them in host memory
// TODO: in the future, we can make copies in the backend buffers to avoid host -> device transfers
if (attn_rot_k || attn_rot_v) {
for (int64_t n = 64; n <= std::max(hparams.n_embd_head_k(), hparams.n_embd_head_v()); n *= 2) {
for (int64_t n = 64; n <= std::max(n_embd_head_k_all, n_embd_head_v_all); n *= 2) {
attn_rot_hadamard[n] = std::vector<float>(n*n);
ggml_init_params params = {
@@ -1308,7 +1320,7 @@ ggml_tensor * llama_kv_cache::build_input_k_rot(ggml_context * ctx) const {
// ref: https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4141323088
do {
nrot *= 2;
} while (hparams.n_embd_head_k() % nrot == 0);
} while (n_embd_head_k_all % nrot == 0);
nrot /= 2;
res = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, nrot, nrot);
+5
View File
@@ -239,6 +239,11 @@ private:
bool attn_rot_k = false;
bool attn_rot_v = false;
// if all layers participating in the cache have constant head size, the value is stored here
// otherwise the value is -1
int32_t n_embd_head_k_all = 0;
int32_t n_embd_head_v_all = 0;
// pre-computed hadamard martrices
std::unordered_map<int64_t, std::vector<float>> attn_rot_hadamard;
+5 -2
View File
@@ -36,6 +36,7 @@ static std::string llama_model_ftype_name(llama_ftype ftype) {
case LLAMA_FTYPE_ALL_F32: return "all F32";
case LLAMA_FTYPE_MOSTLY_F16: return "F16";
case LLAMA_FTYPE_MOSTLY_BF16: return "BF16";
case LLAMA_FTYPE_MOSTLY_Q1_0: return "Q1_0";
case LLAMA_FTYPE_MOSTLY_Q4_0: return "Q4_0";
case LLAMA_FTYPE_MOSTLY_Q4_1: return "Q4_1";
case LLAMA_FTYPE_MOSTLY_Q5_0: return "Q5_0";
@@ -374,8 +375,9 @@ namespace GGUFMeta {
}
} else {
if (arr_info.gt == GGUF_TYPE_BOOL) {
std::transform((const bool *)arr_info.data, (const bool *)arr_info.data + arr_info.length, result.begin(), [](bool x) {
return static_cast<T>(x);
const int8_t * values = (const int8_t *) arr_info.data;
std::transform(values, values + arr_info.length, result.begin(), [](int8_t x) {
return static_cast<T>(x != 0);
});
} else {
std::copy((const T*)arr_info.data, (const T *)arr_info.data + arr_info.length, result.begin());
@@ -757,6 +759,7 @@ llama_model_loader::llama_model_loader(
case GGML_TYPE_IQ4_XS: ftype = LLAMA_FTYPE_MOSTLY_IQ4_XS; break;
case GGML_TYPE_IQ3_S: ftype = LLAMA_FTYPE_MOSTLY_IQ3_S; break;
case GGML_TYPE_NVFP4: ftype = LLAMA_FTYPE_MOSTLY_NVFP4; break;
case GGML_TYPE_Q1_0: ftype = LLAMA_FTYPE_MOSTLY_Q1_0; break;
default:
{
LLAMA_LOG_WARN("%s: unknown type %s\n", __func__, ggml_type_name(type_max));
+11 -9
View File
@@ -1279,6 +1279,7 @@ void llama_model::load_hparams(llama_model_loader & ml) {
ml.get_key(LLM_KV_EMBEDDING_LENGTH_PER_LAYER, hparams.n_embd_per_layer);
ml.get_key(LLM_KV_ATTENTION_KEY_LENGTH_SWA, hparams.n_embd_head_k_swa);
ml.get_key(LLM_KV_ATTENTION_VALUE_LENGTH_SWA, hparams.n_embd_head_v_swa);
ml.get_key(LLM_KV_FINAL_LOGIT_SOFTCAPPING, hparams.f_final_logit_softcapping, false);
switch (hparams.n_layer) {
case 35: type = LLM_TYPE_E2B; break;
@@ -4210,13 +4211,14 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
output = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}, TENSOR_DUPLICATED);
}
tok_embd = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}, 0);
tok_embd_per_layer = create_tensor(tn(LLM_TENSOR_PER_LAYER_TOKEN_EMBD, "weight"), {n_embd_altup * n_layer, n_vocab}, 0);
tok_embd = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}, 0);
altup_proj = create_tensor(tn(LLM_TENSOR_ALTUP_PROJ, "weight"), {n_embd, n_embd, n_altup - 1}, 0);
altup_unembd_proj = create_tensor(tn(LLM_TENSOR_ALTUP_UNEMBD_PROJ, "weight"), {n_embd, n_embd, n_altup - 1}, 0);
per_layer_model_proj = create_tensor(tn(LLM_TENSOR_PER_LAYER_MODEL_PROJ, "weight"), {n_embd, n_embd_altup * n_layer}, 0);
per_layer_proj_norm = create_tensor(tn(LLM_TENSOR_PER_LAYER_PROJ_NORM, "weight"), {n_embd_altup}, 0);
altup_proj = create_tensor(tn(LLM_TENSOR_ALTUP_PROJ, "weight"), {n_embd, n_embd, n_altup - 1}, 0);
altup_unembd_proj = create_tensor(tn(LLM_TENSOR_ALTUP_UNEMBD_PROJ, "weight"), {n_embd, n_embd, n_altup - 1}, 0);
per_layer_tok_embd = create_tensor(tn(LLM_TENSOR_PER_LAYER_TOKEN_EMBD, "weight"), {n_embd_altup * n_layer, n_vocab}, 0);
per_layer_model_proj = create_tensor(tn(LLM_TENSOR_PER_LAYER_MODEL_PROJ, "weight", 0), {n_embd, n_embd_altup * n_layer}, 0);
per_layer_proj_norm = create_tensor(tn(LLM_TENSOR_PER_LAYER_PROJ_NORM, "weight", 0), {n_embd_altup}, 0);
output_norm = create_tensor(tn(LLM_TENSOR_OUTPUT_NORM, "weight"), {n_embd}, 0);
@@ -4275,9 +4277,9 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
tok_embd = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}, 0);
if (n_embd_per_layer > 0) {
tok_embd_per_layer = create_tensor(tn(LLM_TENSOR_PER_LAYER_TOKEN_EMBD, "weight"), {n_embd_per_layer * n_layer, n_vocab}, 0);
per_layer_model_proj = create_tensor(tn(LLM_TENSOR_PER_LAYER_MODEL_PROJ, "weight"), {n_embd, n_embd_per_layer * n_layer}, 0);
per_layer_proj_norm = create_tensor(tn(LLM_TENSOR_PER_LAYER_PROJ_NORM, "weight"), {n_embd_per_layer}, 0);
per_layer_tok_embd = create_tensor(tn(LLM_TENSOR_PER_LAYER_TOKEN_EMBD, "weight"), {n_embd_per_layer * n_layer, n_vocab}, 0);
per_layer_model_proj = create_tensor(tn(LLM_TENSOR_PER_LAYER_MODEL_PROJ, "weight", 0), {n_embd, n_embd_per_layer * n_layer}, 0);
per_layer_proj_norm = create_tensor(tn(LLM_TENSOR_PER_LAYER_PROJ_NORM, "weight", 0), {n_embd_per_layer}, 0);
}
output_norm = create_tensor(tn(LLM_TENSOR_OUTPUT_NORM, "weight"), {n_embd}, 0);
+1 -1
View File
@@ -534,9 +534,9 @@ struct llama_model {
struct ggml_tensor * conv1d_b = nullptr;
// gemma3n altup
struct ggml_tensor * tok_embd_per_layer = nullptr;
struct ggml_tensor * altup_proj = nullptr;
struct ggml_tensor * altup_unembd_proj = nullptr;
struct ggml_tensor * per_layer_tok_embd = nullptr;
struct ggml_tensor * per_layer_model_proj = nullptr;
struct ggml_tensor * per_layer_proj_norm = nullptr;
+1
View File
@@ -799,6 +799,7 @@ ggml_type llama_ftype_get_default_type(llama_ftype ftype) {
case LLAMA_FTYPE_MOSTLY_F16: return GGML_TYPE_F16;
case LLAMA_FTYPE_MOSTLY_BF16: return GGML_TYPE_BF16;
case LLAMA_FTYPE_ALL_F32: return GGML_TYPE_F32;
case LLAMA_FTYPE_MOSTLY_Q1_0: return GGML_TYPE_Q1_0;
case LLAMA_FTYPE_MOSTLY_MXFP4_MOE: return GGML_TYPE_MXFP4;
+45 -2
View File
@@ -2325,6 +2325,14 @@ void llama_vocab::impl::load(llama_model_loader & ml, const LLM_KV & kv) {
if (ml.get_key(LLM_KV_TOKENIZER_ADD_SEP, temp, false)) {
add_sep = temp;
}
// workaround for Gemma 4
// ref: https://github.com/ggml-org/llama.cpp/pull/21500
if (pre_type == LLAMA_VOCAB_PRE_TYPE_GEMMA4 && !add_bos) {
add_bos = true;
LLAMA_LOG_WARN("%s: override '%s' to 'true' for Gemma4\n", __func__, kv(LLM_KV_TOKENIZER_ADD_BOS).c_str());
}
}
// auto-detect special tokens by text
@@ -2550,7 +2558,9 @@ void llama_vocab::impl::load(llama_model_loader & ml, const LLM_KV & kv) {
|| t.first == "[EOS]" // Kimi-K2
|| t.first == "<|end_of_text|>"
|| t.first == "<end_of_utterance>" // smoldocling
|| t.first == "<turn|>" // gemma4
|| t.first == "<eos>" // gemma4
|| t.first == "<turn|>" // gemma4
|| t.first == "<|tool_response>" // gemma4
|| t.first == "<end▁of▁sentence>" // deepseek-ocr
) {
special_eog_ids.insert(t.second);
@@ -2636,6 +2646,33 @@ void llama_vocab::impl::load(llama_model_loader & ml, const LLM_KV & kv) {
LLAMA_LOG_WARN("%s: special_eog_ids contains both '<|return|>' and '<|call|>', or '<|calls|>' and '<|flush|>' tokens, removing '<|end|>' token from EOG list\n", __func__);
}
}
// workaround for gemma4 and paddleocr: do not include </s> as an eog token
{
bool has_tool_response = false;
bool has_s = false;
llama_token s_id = LLAMA_TOKEN_NULL;
for (auto tid : special_eog_ids) {
const auto & text = id_to_token[tid].text;
if (text == "<|tool_response>") {
has_tool_response = true;
} else if (text == "</s>") {
has_s = true;
s_id = tid;
}
}
if (has_tool_response && has_s) {
special_eog_ids.erase(s_id);
auto & attr = id_to_token[s_id].attr;
attr = LLAMA_TOKEN_ATTR_NORMAL;
LLAMA_LOG_WARN("%s: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list\n", __func__);
}
}
}
// build special tokens cache
@@ -2804,7 +2841,9 @@ uint8_t llama_vocab::impl::token_to_byte(llama_token id) const {
return strtol(buf.c_str(), NULL, 16);
}
case LLAMA_VOCAB_TYPE_BPE: {
GGML_ABORT("fatal error");
// Gemma4 uses BPE with SPM-style byte fallback tokens (<0xXX>)
auto buf = token_data.text.substr(3, 2);
return strtol(buf.c_str(), NULL, 16);
}
case LLAMA_VOCAB_TYPE_WPM: {
GGML_ABORT("fatal error");
@@ -3285,6 +3324,10 @@ int32_t llama_vocab::impl::token_to_piece(llama_token token, char * buf, int32_t
std::string result = llama_decode_text(token_text);
return _try_copy(result.data(), result.size());
}
if (attr & LLAMA_TOKEN_ATTR_BYTE) {
char byte = (char) token_to_byte(token);
return _try_copy((char*) &byte, 1);
}
break;
}
case LLAMA_VOCAB_TYPE_RWKV: {
+36 -32
View File
@@ -1,5 +1,12 @@
#include "models.h"
// get 2D slice view from a 3D tensor, the idx corresponds to the 3rd dim
static ggml_tensor * ggml_view_2d_slice(ggml_context * ctx0, ggml_tensor * x, int idx) {
GGML_ASSERT(idx < (int) x->ne[2]);
return ggml_view_2d(ctx0, x, x->ne[0], x->ne[1], ggml_row_size(x->type, x->ne[0]),
idx * x->ne[0] * x->ne[1] * ggml_element_size(x));
}
llm_build_gemma3n_iswa::llm_build_gemma3n_iswa(const llama_model & model, const llm_graph_params & params) :
llm_graph_context(params),
model(model),
@@ -22,8 +29,11 @@ llm_build_gemma3n_iswa::llm_build_gemma3n_iswa(const llama_model & model, const
// TODO: is causal == true correct? might need some changes
auto * inp_attn = build_attn_inp_kv_iswa();
// inp_per_layer shape: [n_embd_altup, n_tokens, n_layer]
ggml_tensor * inp_per_layer = project_per_layer_inputs(inpL, get_per_layer_inputs());
ggml_tensor * inp_per_layer = build_inp_per_layer();
ggml_build_forward_expand(gf, inp_per_layer);
// inp_per_layer now has shape: [n_embd_altup, n_tokens, n_layer]
inp_per_layer = project_per_layer_inputs(inpL, inp_per_layer);
// inpL now has only 1 altup, project it to the rest of the altups
// these "added" altups will be concat to the last dim of inpL
@@ -37,8 +47,7 @@ llm_build_gemma3n_iswa::llm_build_gemma3n_iswa(const llama_model & model, const
inpL = ggml_concat(ctx0, inpL, altup_added, 2); // shape: [n_embd, n_tokens, n_altup]
cb(inpL, "inp_stacked", -1);
}
// inpL now has shape: [n_embd, n_tokens, n_altup]
// inp_per_layer now has shape: [n_embd_altup, n_tokens, n_layer]
// inpL now has shape: [n_embd, n_tokens, n_altup]
for (int il = 0; il < n_layer; ++il) {
// this block is made to be closely resemble Gemma3p5DecoderLayer on python code
@@ -49,8 +58,8 @@ llm_build_gemma3n_iswa::llm_build_gemma3n_iswa(const llama_model & model, const
ggml_tensor * predictions = altup_predict(cur, il); // [n_embd, n_tokens, n_altup]
// predicted value will go through self-attention and laurel
ggml_tensor * active_prediction = view_2d_slice(predictions, i_altup_act); // [n_embd, n_tokens]
cur = active_prediction;
ggml_tensor * active_prediction = ggml_view_2d_slice(ctx0, predictions, i_altup_act); // [n_embd, n_tokens]
cur = active_prediction;
cb(cur, "active_prediction", il);
// norm
@@ -151,12 +160,13 @@ llm_build_gemma3n_iswa::llm_build_gemma3n_iswa(const llama_model & model, const
ggml_tensor * first_prediction; // [n_embd, n_tokens]
{
first_prediction = view_2d_slice(corrected, i_altup_act); // [n_embd, n_tokens]
first_prediction = ggml_view_2d_slice(ctx0, corrected, i_altup_act); // [n_embd, n_tokens]
first_prediction = ggml_mul(ctx0, first_prediction, model.layers[il].altup_correct_scale);
first_prediction = build_lora_mm(model.layers[il].per_layer_inp_gate, first_prediction);
first_prediction = ggml_gelu(ctx0, first_prediction); // [n_embd_altup, n_tokens]
cb(first_prediction, "first_prediction_gated", il);
ggml_tensor * inp_this_layer = view_2d_slice(inp_per_layer, il); // [n_embd_altup, n_tokens]
ggml_tensor * inp_this_layer = ggml_view_2d_slice(ctx0, inp_per_layer, il); // [n_embd_altup, n_tokens]
first_prediction = ggml_mul(ctx0, first_prediction, inp_this_layer); // [n_embd_altup, n_tokens]
cb(first_prediction, "first_prediction_scaled", il);
@@ -167,7 +177,7 @@ llm_build_gemma3n_iswa::llm_build_gemma3n_iswa(const llama_model & model, const
}
// equivalent to python code: corrected_predictions[1:] += first_prediction
{
ggml_tensor * slice_first = view_2d_slice(corrected, 0);
ggml_tensor * slice_first = ggml_view_2d_slice(ctx0, corrected, 0);
ggml_tensor * slice_rest = ggml_view_3d(
ctx0, corrected, n_embd, n_tokens, n_altup - 1, ggml_row_size(corrected->type, n_embd),
ggml_row_size(corrected->type, n_embd * n_tokens), n_embd * n_tokens * ggml_element_size(corrected));
@@ -185,7 +195,7 @@ llm_build_gemma3n_iswa::llm_build_gemma3n_iswa(const llama_model & model, const
// cur now has multiple altup(s), we want to merge them back to 1 altup
{
ggml_tensor * target_magnitude = calc_magnitude(view_2d_slice(cur, i_altup_act)); // [n_embd, n_tokens]
ggml_tensor * target_magnitude = calc_magnitude(ggml_view_2d_slice(ctx0, cur, i_altup_act)); // [n_embd, n_tokens]
// do a view to skip the first slice (active altup)
ggml_tensor * alt_slice =
ggml_view_3d(ctx0, cur, n_embd, n_tokens, n_altup - 1, ggml_row_size(cur->type, n_embd),
@@ -197,9 +207,9 @@ llm_build_gemma3n_iswa::llm_build_gemma3n_iswa(const llama_model & model, const
cb(altup_unembd, "altup_unembd", -1);
// equivalent to torch.mean(hidden_states, dim=0)
cur = view_2d_slice(cur, 0); // [n_embd, n_tokens]
cur = ggml_view_2d_slice(ctx0, cur, 0); // [n_embd, n_tokens]
for (int i = 0; i < n_altup - 1; ++i) {
cur = ggml_add(ctx0, cur, view_2d_slice(altup_unembd, i));
cur = ggml_add(ctx0, cur, ggml_view_2d_slice(ctx0, altup_unembd, i));
}
cur = ggml_scale(ctx0, cur, 1.0f / float(n_altup)); // [n_embd, n_tokens]
cb(cur, "unembd_merged", -1);
@@ -235,23 +245,16 @@ ggml_tensor * llm_build_gemma3n_iswa::calc_magnitude(ggml_tensor * x) {
return ggml_sqrt(ctx0, ggml_sum_rows(ctx0, ggml_sqr(ctx0, x)));
}
// get 2D slice view from a 3D tensor, the idx corresponds to the 3rd dim
ggml_tensor * llm_build_gemma3n_iswa::view_2d_slice(ggml_tensor * x, int idx) {
GGML_ASSERT(idx < (int) x->ne[2]);
return ggml_view_2d(ctx0, x, x->ne[0], x->ne[1], ggml_row_size(x->type, x->ne[0]),
idx * x->ne[0] * x->ne[1] * ggml_element_size(x));
}
// equivalent to get_per_layer_inputs() in python code
// output shape: [n_embd_altup, n_layer, n_tokens]
ggml_tensor * llm_build_gemma3n_iswa::get_per_layer_inputs() {
ggml_tensor * llm_build_gemma3n_iswa::build_inp_per_layer() {
auto inp = std::make_unique<llm_graph_input_embd>(n_embd);
ggml_tensor * inp_per_layer;
if (ubatch.token) {
inp->tokens = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, ubatch.n_tokens);
ggml_set_input(inp->tokens);
res->t_inp_tokens = inp->tokens;
inp_per_layer = ggml_get_rows(ctx0, model.tok_embd_per_layer, inp->tokens);
inp_per_layer = ggml_get_rows(ctx0, model.per_layer_tok_embd, inp->tokens);
inp_per_layer = ggml_reshape_3d(ctx0, inp_per_layer, n_embd_altup, n_layer, n_tokens);
inp_per_layer = ggml_scale(ctx0, inp_per_layer, sqrtf((float) n_embd_altup));
cb(inp_per_layer, "inp_per_layer_selected", -1);
@@ -259,10 +262,10 @@ ggml_tensor * llm_build_gemma3n_iswa::get_per_layer_inputs() {
} else {
// Vision embedding path: use padding token (ID=0) embedding
// TODO: verify if this is the correct behavior in transformers implementation
const int64_t embd_size = model.tok_embd_per_layer->ne[0]; // n_embd_altup * n_layer
const int64_t embd_size = model.per_layer_tok_embd->ne[0]; // n_embd_altup * n_layer
// Extract and dequantize padding token embedding (row 0)
ggml_tensor * padding = ggml_view_1d(ctx0, model.tok_embd_per_layer, embd_size, 0);
ggml_tensor * padding = ggml_view_1d(ctx0, model.per_layer_tok_embd, embd_size, 0);
inp_per_layer = ggml_cast(ctx0, padding, GGML_TYPE_F32);
// Reshape to [n_embd_altup, n_layer, 1]
@@ -275,18 +278,19 @@ ggml_tensor * llm_build_gemma3n_iswa::get_per_layer_inputs() {
// equivalent to project_per_layer_inputs() in python code
// this calculates the per-layer inputs, so the final tensor shape will have n_layer as the last dim
// output shape: [n_embd_altup, n_tokens, n_layer]
ggml_tensor * llm_build_gemma3n_iswa::project_per_layer_inputs(ggml_tensor * inputs_embeds, ggml_tensor * inp_per_layer) {
ggml_tensor * llm_build_gemma3n_iswa::project_per_layer_inputs(ggml_tensor * inp_batch, ggml_tensor * inp_per_layer) {
const float per_layer_projection_scale = 1.0f / sqrtf((float) n_embd);
const float per_layer_input_scale = 1.0f / sqrtf(2.0f);
ggml_tensor * per_layer_proj = ggml_mul_mat(ctx0, model.per_layer_model_proj, inputs_embeds);
per_layer_proj = ggml_scale(ctx0, per_layer_proj, per_layer_projection_scale);
per_layer_proj = ggml_reshape_3d(ctx0, per_layer_proj, n_embd_altup, n_layer, n_tokens);
per_layer_proj = build_norm(per_layer_proj, model.per_layer_proj_norm, NULL, LLM_NORM_RMS,
-1); // [n_embd_altup, n_layer, n_tokens]
ggml_tensor * per_layer_proj;
per_layer_proj = ggml_mul_mat (ctx0, model.per_layer_model_proj, inp_batch);
per_layer_proj = ggml_scale (ctx0, per_layer_proj, per_layer_projection_scale);
per_layer_proj = ggml_reshape_3d(ctx0, per_layer_proj, n_embd_altup, n_layer, n_tokens);
per_layer_proj = build_norm(per_layer_proj, model.per_layer_proj_norm, NULL, LLM_NORM_RMS, -1);
cb(per_layer_proj, "per_layer_proj", -1);
inp_per_layer = ggml_add(ctx0, per_layer_proj, inp_per_layer);
inp_per_layer = ggml_add (ctx0, per_layer_proj, inp_per_layer);
inp_per_layer = ggml_scale(ctx0, inp_per_layer, per_layer_input_scale);
cb(inp_per_layer, "inp_per_layer", -1);
@@ -337,7 +341,7 @@ ggml_tensor * llm_build_gemma3n_iswa::altup_compute_router_modalities(ggml_tenso
// input cur shape: [n_embd, n_tokens, n_altup]
// output shape: [n_embd, n_tokens, n_altup]
ggml_tensor * llm_build_gemma3n_iswa::altup_predict(ggml_tensor * cur, int il) {
ggml_tensor * activated = view_2d_slice(cur, i_altup_act); // [n_embd, n_tokens]
ggml_tensor * activated = ggml_view_2d_slice(ctx0, cur, i_altup_act); // [n_embd, n_tokens]
ggml_tensor * modalities = altup_compute_router_modalities(activated, il); // [n_altup, n_tokens]
cb(modalities, "modalities", il);
@@ -365,7 +369,7 @@ ggml_tensor * llm_build_gemma3n_iswa::altup_correct(ggml_tensor * predictions, g
ggml_tensor * modalities = altup_compute_router_modalities(activated, il); // [n_altup, n_tokens]
cb(modalities, "modalities", il);
ggml_tensor * active_prediction = view_2d_slice(predictions, i_altup_act);
ggml_tensor * active_prediction = ggml_view_2d_slice(ctx0, predictions, i_altup_act);
ggml_tensor * innovation = ggml_sub(ctx0, activated, active_prediction); // [n_embd, n_tokens]
cb(innovation, "innovation", il);
+37 -28
View File
@@ -1,5 +1,12 @@
#include "models.h"
// get 2D slice view from a 3D tensor, the idx corresponds to the 3rd dim
static ggml_tensor * ggml_view_2d_slice(ggml_context * ctx0, ggml_tensor * x, int idx) {
GGML_ASSERT(idx < (int) x->ne[2]);
return ggml_view_2d(ctx0, x, x->ne[0], x->ne[1], ggml_row_size(x->type, x->ne[0]),
idx * x->ne[0] * x->ne[1] * ggml_element_size(x));
}
llm_build_gemma4_iswa::llm_build_gemma4_iswa(const llama_model & model, const llm_graph_params & params) :
llm_graph_context(params),
model(model),
@@ -19,14 +26,17 @@ llm_build_gemma4_iswa::llm_build_gemma4_iswa(const llama_model & model, const ll
// TODO: is causal == true correct? might need some changes
auto * inp_attn = build_attn_inp_kv_iswa();
// inp_per_layer shape: [n_embd_per_layer, n_tokens, n_layer]
ggml_tensor * inp_per_layer = nullptr;
if (model.tok_embd_per_layer) {
inp_per_layer = project_per_layer_inputs(inpL, get_per_layer_inputs());
}
ggml_tensor * inp_out_ids = build_inp_out_ids();
ggml_tensor * inp_per_layer = nullptr;
if (model.per_layer_tok_embd) {
inp_per_layer = build_inp_per_layer();
ggml_build_forward_expand(gf, inp_per_layer);
// inp_per_layer shape: [n_embd_per_layer, n_tokens, n_layer]
inp_per_layer = project_per_layer_inputs(inpL, inp_per_layer);
}
for (int il = 0; il < n_layer; ++il) {
const int64_t n_embd_head = hparams.n_embd_head_k(il);
GGML_ASSERT(n_embd_head == hparams.n_embd_head_v(il));
@@ -196,7 +206,8 @@ llm_build_gemma4_iswa::llm_build_gemma4_iswa(const llama_model & model, const ll
cur = build_lora_mm(model.layers[il].per_layer_inp_gate, cur); // [n_embd_per_layer, n_tokens]
cur = ggml_gelu(ctx0, cur);
ggml_tensor * inp_this_layer = view_2d_slice(inp_per_layer, il); // [n_embd_per_layer, n_tokens]
ggml_tensor * inp_this_layer = ggml_view_2d_slice(ctx0, inp_per_layer, il); // [n_embd_per_layer, n_tokens]
// TODO @ngxson : improve this
if (il == n_layer - 1 && inp_out_ids) {
@@ -248,34 +259,30 @@ llm_build_gemma4_iswa::llm_build_gemma4_iswa(const llama_model & model, const ll
ggml_build_forward_expand(gf, cur);
}
// get 2D slice view from a 3D tensor, the idx corresponds to the 3rd dim
ggml_tensor * llm_build_gemma4_iswa::view_2d_slice(ggml_tensor * x, int idx) {
GGML_ASSERT(idx < (int) x->ne[2]);
return ggml_view_2d(ctx0, x, x->ne[0], x->ne[1], ggml_row_size(x->type, x->ne[0]),
idx * x->ne[0] * x->ne[1] * ggml_element_size(x));
}
// equivalent to get_per_layer_inputs() in python code
// output shape: [n_embd_per_layer, n_layer, n_tokens]
ggml_tensor * llm_build_gemma4_iswa::get_per_layer_inputs() {
ggml_tensor * llm_build_gemma4_iswa::build_inp_per_layer() {
auto inp = std::make_unique<llm_graph_input_embd>(n_embd);
ggml_tensor * inp_per_layer;
if (ubatch.token) {
inp->tokens = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, ubatch.n_tokens);
ggml_set_input(inp->tokens);
res->t_inp_tokens = inp->tokens;
inp_per_layer = ggml_get_rows(ctx0, model.tok_embd_per_layer, inp->tokens);
inp_per_layer = ggml_get_rows (ctx0, model.per_layer_tok_embd, inp->tokens);
inp_per_layer = ggml_reshape_3d(ctx0, inp_per_layer, n_embd_per_layer, n_layer, n_tokens);
inp_per_layer = ggml_scale(ctx0, inp_per_layer, sqrtf((float) n_embd_per_layer));
inp_per_layer = ggml_scale (ctx0, inp_per_layer, sqrtf((float) n_embd_per_layer));
cb(inp_per_layer, "inp_per_layer_selected", -1);
res->add_input(std::move(inp));
} else {
// Vision embedding path: use padding token (ID=0) embedding
// TODO: verify if this is the correct behavior in transformers implementation
const int64_t embd_size = model.tok_embd_per_layer->ne[0]; // n_embd_per_layer * n_layer
const int64_t embd_size = model.per_layer_tok_embd->ne[0]; // n_embd_per_layer * n_layer
// Extract and dequantize padding token embedding (row 0)
ggml_tensor * padding = ggml_view_1d(ctx0, model.tok_embd_per_layer, embd_size, 0);
ggml_tensor * padding = ggml_view_1d(ctx0, model.per_layer_tok_embd, embd_size, 0);
inp_per_layer = ggml_cast(ctx0, padding, GGML_TYPE_F32);
// Reshape to [n_embd_per_layer, n_layer, 1]
@@ -287,21 +294,23 @@ ggml_tensor * llm_build_gemma4_iswa::get_per_layer_inputs() {
// equivalent to project_per_layer_inputs() in python code
// this calculates the per-layer inputs, so the final tensor shape will have n_layer as the last dim
// inputs_embeds shape: [n_embd, n_tokens]
// inp_per_layer shape: [n_embd_per_layer, n_layer, n_tokens] (from get_per_layer_inputs)
// inp_batch shape: [n_embd, n_tokens]
// inp_per_layer shape: [n_embd_per_layer, n_layer, n_tokens] (from build_inp_per_layer)
// output shape: [n_embd_per_layer, n_tokens, n_layer]
ggml_tensor * llm_build_gemma4_iswa::project_per_layer_inputs(ggml_tensor * inputs_embeds, ggml_tensor * inp_per_layer) {
ggml_tensor * llm_build_gemma4_iswa::project_per_layer_inputs(ggml_tensor * inp_batch, ggml_tensor * inp_per_layer) {
const float per_layer_projection_scale = 1.0f / sqrtf((float) n_embd);
const float per_layer_input_scale = 1.0f / sqrtf(2.0f);
ggml_tensor * per_layer_proj = ggml_mul_mat(ctx0, model.per_layer_model_proj, inputs_embeds);
per_layer_proj = ggml_scale(ctx0, per_layer_proj, per_layer_projection_scale);
per_layer_proj = ggml_reshape_3d(ctx0, per_layer_proj, n_embd_per_layer, n_layer, n_tokens);
per_layer_proj = build_norm(per_layer_proj, model.per_layer_proj_norm, nullptr, LLM_NORM_RMS,
-1); // [n_embd_per_layer, n_layer, n_tokens]
// note: this matrix multiplication will be performed in the input layer (i.e. on the CPU)
ggml_tensor * per_layer_proj;
per_layer_proj = ggml_mul_mat (ctx0, model.per_layer_model_proj, inp_batch);
per_layer_proj = ggml_scale (ctx0, per_layer_proj, per_layer_projection_scale);
per_layer_proj = ggml_reshape_3d(ctx0, per_layer_proj, n_embd_per_layer, n_layer, n_tokens);
per_layer_proj = build_norm(per_layer_proj, model.per_layer_proj_norm, nullptr, LLM_NORM_RMS, -1);
cb(per_layer_proj, "per_layer_proj", -1);
inp_per_layer = ggml_add(ctx0, per_layer_proj, inp_per_layer);
inp_per_layer = ggml_add (ctx0, per_layer_proj, inp_per_layer);
inp_per_layer = ggml_scale(ctx0, inp_per_layer, per_layer_input_scale);
cb(inp_per_layer, "inp_per_layer", -1);
+9 -6
View File
@@ -256,9 +256,11 @@ struct llm_build_gemma3n_iswa : public llm_graph_context {
llm_build_gemma3n_iswa(const llama_model & model, const llm_graph_params & params);
ggml_tensor * calc_magnitude(ggml_tensor * x);
ggml_tensor * view_2d_slice(ggml_tensor * x, int idx);
ggml_tensor * get_per_layer_inputs();
ggml_tensor * project_per_layer_inputs(ggml_tensor * inputs_embeds, ggml_tensor * inp_per_layer);
// TODO: refactor in common "per-layer" functionality [TAG_PER_LAYER]
ggml_tensor * build_inp_per_layer();
ggml_tensor * project_per_layer_inputs(ggml_tensor * inp_batch, ggml_tensor * inp_per_layer);
ggml_tensor * gaussian_topk(ggml_tensor * x);
ggml_tensor * altup_compute_router_modalities(ggml_tensor * x, int il);
ggml_tensor * altup_predict(ggml_tensor * cur, int il);
@@ -272,9 +274,10 @@ struct llm_build_gemma4_iswa : public llm_graph_context {
const int64_t n_embd_per_layer;
llm_build_gemma4_iswa(const llama_model & model, const llm_graph_params & params);
ggml_tensor * view_2d_slice(ggml_tensor * x, int idx);
ggml_tensor * get_per_layer_inputs();
ggml_tensor * project_per_layer_inputs(ggml_tensor * inputs_embeds, ggml_tensor * inp_per_layer);
// TODO: refactor in common "per-layer" functionality [TAG_PER_LAYER]
ggml_tensor * build_inp_per_layer();
ggml_tensor * project_per_layer_inputs(ggml_tensor * inp_batch, ggml_tensor * inp_per_layer);
};
struct llm_build_gemma_embedding : public llm_graph_context {
+169 -1
View File
@@ -470,6 +470,141 @@ static std::vector<size_t> unicode_regex_split_custom_llama3(const std::string &
return bpe_offsets;
}
// Qwen2 system regex: "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
static std::vector<size_t> unicode_regex_split_custom_qwen2(const std::string & text, const std::vector<size_t> & offsets) {
std::vector<size_t> bpe_offsets; // store the offset of each word
bpe_offsets.reserve(offsets.size()); // Reserve memory for the approximate size
const auto cpts = unicode_cpts_from_utf8(text);
size_t start = 0;
for (auto offset : offsets) {
const size_t offset_ini = start;
const size_t offset_end = start + offset;
assert(offset_end <= cpts.size());
start = offset_end;
static const uint32_t OUT_OF_RANGE = 0xFFFFFFFF;
auto _get_cpt = [&] (const size_t pos) -> uint32_t {
return (offset_ini <= pos && pos < offset_end) ? cpts[pos] : OUT_OF_RANGE;
};
auto _get_flags = [&] (const size_t pos) -> unicode_cpt_flags {
return (offset_ini <= pos && pos < offset_end) ? unicode_cpt_flags_from_cpt(cpts[pos]) : unicode_cpt_flags{};
};
size_t _prev_end = offset_ini;
auto _add_token = [&] (const size_t end) -> size_t {
assert(_prev_end <= end && end <= offset_end);
size_t len = end - _prev_end;
if (len > 0) {
bpe_offsets.push_back(len);
}
_prev_end = end;
//if (len > 0) {
// std::string s = "";
// for(size_t p = end-len; p < end; p++)
// s += unicode_cpt_to_utf8(cpts[p]);
// printf(">>> '%s'\n", s.c_str());
//}
return len;
};
for (size_t pos = offset_ini; pos < offset_end; /*pos++*/ ) {
const uint32_t cpt = _get_cpt(pos);
const auto flags = _get_flags(pos);
// regex: (?i:'s|'t|'re|'ve|'m|'ll|'d) // case insensitive
if (cpt == '\'' && pos+1 < offset_end) {
uint32_t cpt_next = unicode_tolower(_get_cpt(pos+1));
if (cpt_next == 's' || cpt_next == 't' || cpt_next == 'm' || cpt_next == 'd') {
pos += _add_token(pos+2);
continue;
}
if (pos+2 < offset_end) {
uint32_t cpt_next_next = unicode_tolower(_get_cpt(pos+2));
if ((cpt_next == 'r' && cpt_next_next == 'e') ||
(cpt_next == 'v' && cpt_next_next == 'e') ||
(cpt_next == 'l' && cpt_next_next == 'l')) {
pos += _add_token(pos+3);
continue;
}
}
}
// regex: [^\r\n\p{L}\p{N}]?\p{L}+
if (!(cpt == '\r' || cpt == '\n' || flags.is_number)) {
if (flags.is_letter || _get_flags(pos+1).is_letter) { // one or more letters
pos++;
while (_get_flags(pos).is_letter) {
pos++;
}
_add_token(pos);
continue;
}
}
// regex: \p{N}
if (flags.is_number) {
pos++;
_add_token(pos);
continue;
}
// regex: <space>?[^\s\p{L}\p{N}]+[\r\n]*
auto flags2 = (cpt == ' ' ? _get_flags(pos+1) : flags);
if (!(flags2.is_whitespace | flags2.is_letter | flags2.is_number) && flags.as_uint()) {
pos += (cpt == ' ');
while (!(flags2.is_whitespace | flags2.is_letter | flags2.is_number) && flags2.as_uint()) {
flags2 = _get_flags(++pos);
}
uint32_t cpt2 = _get_cpt(pos);
while (cpt2 == '\r' || cpt2 == '\n') {
cpt2 = _get_cpt(++pos);
}
_add_token(pos);
continue;
}
size_t num_whitespaces = 0;
size_t last_end_r_or_n = 0;
while (_get_flags(pos+num_whitespaces).is_whitespace) {
uint32_t cpt2 = _get_cpt(pos+num_whitespaces);
if (cpt2 == '\r' || cpt2 == '\n') {
last_end_r_or_n = pos + num_whitespaces + 1;
}
num_whitespaces++;
}
// regex: \s*[\r\n]+
if (last_end_r_or_n > 0) {
pos = last_end_r_or_n;
_add_token(pos);
continue;
}
// regex: \s+(?!\S)
if (num_whitespaces > 1 && _get_cpt(pos+num_whitespaces) != OUT_OF_RANGE) {
pos += num_whitespaces - 1;
_add_token(pos);
continue;
}
// regex: \s+
if (num_whitespaces > 0) {
pos += num_whitespaces;
_add_token(pos);
continue;
}
// no matches
_add_token(++pos);
}
}
return bpe_offsets;
}
template <typename CharT>
static std::vector<size_t> unicode_regex_split_stl(const std::basic_string<CharT> & text, const std::basic_string<CharT> & regex, const std::vector<size_t> & offsets) {
using BidirIt = typename std::basic_string<CharT>::const_iterator;
@@ -753,6 +888,35 @@ static std::vector<size_t> unicode_regex_split_custom_afmoe(const std::string &
return bpe_offsets;
}
// regex: [^\n]+|[\n]+
// splits text into runs of non-newline characters and runs of newline characters
static std::vector<size_t> unicode_regex_split_custom_newlines(const std::string & text, const std::vector<size_t> & offsets) {
std::vector<size_t> bpe_offsets;
bpe_offsets.reserve(offsets.size());
const auto cpts = unicode_cpts_from_utf8(text);
size_t start = 0;
for (auto offset : offsets) {
const size_t offset_ini = start;
const size_t offset_end = start + offset;
assert(offset_end <= cpts.size());
start = offset_end;
size_t pos = offset_ini;
while (pos < offset_end) {
const bool is_newline = (cpts[pos] == '\n');
const size_t run_start = pos;
while (pos < offset_end && (cpts[pos] == '\n') == is_newline) {
pos++;
}
bpe_offsets.push_back(pos - run_start);
}
}
return bpe_offsets;
}
static std::vector<size_t> unicode_regex_split_custom(const std::string & text, const std::string & regex_expr, const std::vector<size_t> & offsets) {
std::vector<size_t> bpe_offsets;
@@ -761,14 +925,18 @@ static std::vector<size_t> unicode_regex_split_custom(const std::string & text,
} else if (
regex_expr == "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" ||
regex_expr == "(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+") {
bpe_offsets = unicode_regex_split_custom_llama3(text, offsets);
} else if (
regex_expr == "(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+") {
bpe_offsets = unicode_regex_split_custom_qwen2(text, offsets);
} else if (regex_expr == "\\p{Han}+") {
// K2's first pattern - handle all K2 patterns together
bpe_offsets = unicode_regex_split_custom_kimi_k2(text, offsets);
} else if (regex_expr == "\\p{AFMoE_digits}") {
// AFMOE digit pattern - use custom implementation for proper splitting
bpe_offsets = unicode_regex_split_custom_afmoe(text, offsets);
} else if (regex_expr == "[^\\n]+|[\\n]+") {
bpe_offsets = unicode_regex_split_custom_newlines(text, offsets);
} else if (regex_expr == "\\d{1,3}(?=(?:\\d{3})*\\b)") {
// tiny_aya digit grouping pattern from tokenizer.json:
// {"type": "Split", "pattern": {"Regex": "\\d{1,3}(?=(?:\\d{3})*\\b)"}, "behavior": "Isolated"}
-10
View File
@@ -1,10 +0,0 @@
import { readFileSync } from "fs"
import { SchemaConverter } from "../tools/server/public_legacy/json-schema-to-grammar.mjs"
const [, , file] = process.argv
const url = `file://${file}`
let schema = JSON.parse(readFileSync(file, "utf8"));
const converter = new SchemaConverter({})
schema = await converter.resolveRefs(schema, url)
converter.visit(schema, '')
console.log(converter.formatGrammar())
+2 -35
View File
@@ -3129,39 +3129,6 @@ struct test_add_id : public test_case {
}
};
// GGML_OP_ADD1
struct test_add1 : public test_case {
const ggml_type type;
const std::array<int64_t, 4> ne;
std::string vars() override {
return VARS_TO_STR2(type, ne);
}
test_add1(ggml_type type = GGML_TYPE_F32,
std::array<int64_t, 4> ne = {10, 5, 4, 3})
: type(type), ne(ne) {}
ggml_tensor * build_graph(ggml_context * ctx) override {
ggml_tensor * a = ggml_new_tensor(ctx, type, 4, ne.data());
ggml_set_param(a);
ggml_set_name(a, "a");
ggml_tensor * b = ggml_new_tensor_1d(ctx, type, 1);
// ggml_set_param(b); // TODO: implement
ggml_set_name(b, "b");
ggml_tensor * out = ggml_add1(ctx, a, b);
ggml_set_name(out, "out");
return out;
}
float grad_eps() override {
return 0.1f * ne[0]*ne[1]*ne[2]*ne[3];
}
};
// GGML_OP_SCALE
struct test_scale : public test_case {
const ggml_type type;
@@ -7284,6 +7251,7 @@ static const ggml_type all_types[] = {
GGML_TYPE_Q4_0, GGML_TYPE_Q4_1,
GGML_TYPE_Q5_0, GGML_TYPE_Q5_1,
GGML_TYPE_Q8_0,
GGML_TYPE_Q1_0,
GGML_TYPE_MXFP4, GGML_TYPE_NVFP4,
GGML_TYPE_Q2_K, GGML_TYPE_Q3_K,
GGML_TYPE_Q4_K, GGML_TYPE_Q5_K,
@@ -7308,6 +7276,7 @@ static const ggml_type other_types[] = {
GGML_TYPE_Q4_1,
GGML_TYPE_Q5_0, GGML_TYPE_Q5_1,
GGML_TYPE_Q8_0,
GGML_TYPE_Q1_0,
GGML_TYPE_Q2_K, GGML_TYPE_Q3_K,
GGML_TYPE_Q5_K,
GGML_TYPE_Q6_K,
@@ -7886,8 +7855,6 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
test_cases.emplace_back(new test_bin_bcast(ggml_add, GGML_TYPE_F32, {16, 5, 4, 3}, {2, 2, 2, 2}, 8));
test_cases.emplace_back(new test_bin_bcast(ggml_add, GGML_TYPE_F32, {16, 5, 4, 3}, {1, 1, 1, 1}, 16));
test_cases.emplace_back(new test_add1());
test_cases.emplace_back(new test_add1(GGML_TYPE_F32, {1024, 1024, 1, 1}));
test_cases.emplace_back(new test_scale());
test_cases.emplace_back(new test_scale(GGML_TYPE_F32, {10, 10, 10, 10}, 2.0f, 1.0f));
test_cases.emplace_back(new test_scale(GGML_TYPE_F32, {10, 10, 10, 10}, 2.0f, 1.0f, true)); // inplace test
+88 -45
View File
@@ -998,6 +998,7 @@ static void test_peg_parser(common_chat_templates * tmpls,
auto parser = make_peg_parser(tmpls, tc.params, detailed_debug);
if (detailed_debug) {
LOG_DBG("Using parser: \n%s\n", parser.arena_.dump(parser.arena_.root()).c_str());
LOG_DBG("Generation prompt: '%s'\n", parser.params_.generation_prompt.c_str());
}
common_chat_msg msg_accum;
@@ -1976,10 +1977,24 @@ static void test_template_output_peg_parsers(bool detailed_debug) {
{
// Google Gemma 4 (tool calling with Gemma4 dict format)
auto tst = peg_tester("models/templates/gemma4.jinja");
auto tst = peg_tester("models/templates/google-gemma-4-31B-it.jinja");
tst.test("Hello, world!").expect(simple_assist_msg("Hello, world!")).run();
// Reasoning and content
tst.test(
"<|channel>thought\nI'm\nthinking<channel|>Hello, world!\nWhat's up?")
.reasoning_format(COMMON_REASONING_FORMAT_AUTO)
.expect(message_assist_thoughts)
.run();
// Reasoning and content with reasoning_format = none
tst.test(
"<|channel>thought\nI'm\nthinking<channel|>Hello, world!\nWhat's up?")
.reasoning_format(COMMON_REASONING_FORMAT_NONE)
.expect_content("<|channel>thought\nI'm\nthinking<channel|>Hello, world!\nWhat's up?")
.run();
// Simple tool call with string argument
tst.test(
"<|tool_call>call:get_time{city:<|\"|>London<|\"|>}<tool_call|>")
@@ -2654,55 +2669,57 @@ static void test_template_output_peg_parsers(bool detailed_debug) {
// #20424 introduced effective_input = generation_prompt + input, but the throw
// uses input.substr(result.end) where result.end is in effective_input space.
{
auto tmpls = common_chat_templates_ptr(
common_chat_templates_init(nullptr, read_file("models/templates/GLM-4.7-Flash.jinja")));
if (!g_template_filter.empty() && std::string("models/templates/GLM-4.7-Flash.jinja").find(g_template_filter) != std::string::npos) {
auto tmpls = common_chat_templates_ptr(
common_chat_templates_init(nullptr, read_file("models/templates/GLM-4.7-Flash.jinja")));
static common_chat_tool weather_tool{
"get_weather", "Get weather",
R"({"type":"object","properties":{"city":{"type":"string"}},"required":["city"]})",
};
static common_chat_tool weather_tool{
"get_weather", "Get weather",
R"({"type":"object","properties":{"city":{"type":"string"}},"required":["city"]})",
};
common_chat_templates_inputs inputs;
inputs.tools = { weather_tool };
inputs.enable_thinking = true;
inputs.reasoning_format = COMMON_REASONING_FORMAT_AUTO;
inputs.add_generation_prompt = true;
inputs.use_jinja = true;
common_chat_msg msg;
msg.role = "user";
msg.content = "get_weather";
inputs.messages = { msg };
common_chat_templates_inputs inputs;
inputs.tools = { weather_tool };
inputs.enable_thinking = true;
inputs.reasoning_format = COMMON_REASONING_FORMAT_AUTO;
inputs.add_generation_prompt = true;
inputs.use_jinja = true;
common_chat_msg msg;
msg.role = "user";
msg.content = "get_weather";
inputs.messages = { msg };
auto params = common_chat_templates_apply(tmpls.get(), inputs);
common_peg_arena arena;
arena.load(params.parser);
common_chat_parser_params pp(params);
auto params = common_chat_templates_apply(tmpls.get(), inputs);
common_peg_arena arena;
arena.load(params.parser);
common_chat_parser_params pp(params);
// generation_prompt is non-empty for thinking models, so result.end
// will be offset by generation_prompt.size() into effective_input space.
assert(!pp.generation_prompt.empty());
// generation_prompt is non-empty for thinking models, so result.end
// will be offset by generation_prompt.size() into effective_input space.
assert(!pp.generation_prompt.empty());
std::string bad_input =
"Thinking.\n"
"</think>"
"<tool_call>get_weather"
"<arg_key>city</arg_key><arg_value>Tokyo</arg_value>"
"</tool_call>\n";
std::string bad_input =
"Thinking.\n"
"</think>"
"<tool_call>get_weather"
"<arg_key>city</arg_key><arg_value>Tokyo</arg_value>"
"</tool_call>\n";
bool got_runtime_error = false;
bool got_out_of_range = false;
std::string error_msg;
try {
common_chat_peg_parse(arena, bad_input, /*is_partial=*/false, pp);
} catch (const std::out_of_range & e) {
got_out_of_range = true;
error_msg = e.what();
} catch (const std::runtime_error & e) {
got_runtime_error = true;
error_msg = e.what();
bool got_runtime_error = false;
bool got_out_of_range = false;
std::string error_msg;
try {
common_chat_peg_parse(arena, bad_input, /*is_partial=*/false, pp);
} catch (const std::out_of_range & e) {
got_out_of_range = true;
error_msg = e.what();
} catch (const std::runtime_error & e) {
got_runtime_error = true;
error_msg = e.what();
}
GGML_ASSERT(!got_out_of_range && "throw path crashed with out_of_range (input.substr in effective_input space)");
GGML_ASSERT(got_runtime_error && "throw path should produce std::runtime_error with parse position");
}
GGML_ASSERT(!got_out_of_range && "throw path crashed with out_of_range (input.substr in effective_input space)");
GGML_ASSERT(got_runtime_error && "throw path should produce std::runtime_error with parse position");
}
// Kimi-K2-Thinking tests - custom parser
@@ -3086,8 +3103,19 @@ static void test_template_output_peg_parsers(bool detailed_debug) {
// Format: <minimax:tool_call><invoke name="func"><parameter name="key">value</parameter></invoke></minimax:tool_call>
{
auto tst = peg_tester("models/templates/MiniMax-M2.jinja", detailed_debug);
tst.test("</think>Hello, world!\nWhat's up?").enable_thinking(true).reasoning_format(COMMON_REASONING_FORMAT_AUTO).expect(message_assist).run();
tst.test("I'm\nthinking</think>Hello, world!\nWhat's up?").enable_thinking(true).reasoning_format(COMMON_REASONING_FORMAT_AUTO).expect(message_assist_thoughts).run();
tst.test("Let's call a tool:</think><minimax:tool_call>\n<invoke name=\"empty_args\">\n</invoke>\n</minimax:tool_call>").
enable_thinking(true).
reasoning_format(COMMON_REASONING_FORMAT_AUTO).
tools({ empty_args_tool }).
expect(message_with_reasoning_and_tool_call("Let's call a tool:", "empty_args", "{}")).
run();
tst.test(
"<minimax:tool_call>\n<invoke name=\"special_function\">\n<parameter "
"</think><minimax:tool_call>\n<invoke name=\"special_function\">\n<parameter "
"name=\"arg1\">1</parameter>\n</invoke>\n</minimax:tool_call>")
.tools({ special_function_tool })
.expect(message_assist_call)
@@ -3282,6 +3310,21 @@ static void test_template_output_peg_parsers(bool detailed_debug) {
.expect(message_assist_call_id)
.expect_reconstruction()
.run();
tst.test("[TOOL_CALLS]special_function[CALL_ID]000000001[ARGS]{\"arg1\": 1}"
"[TOOL_CALLS]special_function_with_opt[CALL_ID]000000002[ARGS]{\"arg1\": 1, \"arg2\": 2}")
.parallel_tool_calls(true)
.tools({
special_function_tool, special_function_tool_with_optional_param
})
.expect_tool_calls({
{ "special_function", R"({"arg1": 1})", "000000001" },
{ "special_function_with_opt", R"({"arg1": 1, "arg2": 2})", "000000002" },
})
.expect_reconstruction()
.run();
}
// Devstral
{
@@ -3411,7 +3454,7 @@ static void test_template_output_peg_parsers(bool detailed_debug) {
},
"replaceAll": {
"type": "boolean",
"description": "Whether to replace all occurences."
"description": "Whether to replace all occurrences."
}
},
"required": ["oldString", "newString"]
-11
View File
@@ -1579,17 +1579,6 @@ int main() {
} else {
fprintf(stderr, "\033[33mWARNING: Python not found (min version required is 3.8), skipping Python JSON schema -> grammar tests.\n\033[0m");
}
if (getenv("LLAMA_NODE_AVAILABLE") || (std::system("node --version") == 0)) {
test_all("JavaScript", [](const TestCase & tc) {
write("test-json-schema-input.tmp", tc.schema);
tc.verify_status(std::system(
"node ./tests/run-json-schema-to-grammar.mjs test-json-schema-input.tmp > test-grammar-output.tmp") == 0 ? SUCCESS : FAILURE);
tc.verify(read("test-grammar-output.tmp"));
});
} else {
fprintf(stderr, "\033[33mWARNING: Node not found, skipping JavaScript JSON schema -> grammar tests.\n\033[0m");
}
}
test_all("Check Expectations Validity", [](const TestCase & tc) {
+5
View File
@@ -16,6 +16,7 @@
constexpr float MAX_QUANTIZATION_REFERENCE_ERROR = 0.0001f;
constexpr float MAX_QUANTIZATION_TOTAL_ERROR = 0.002f;
constexpr float MAX_QUANTIZATION_TOTAL_ERROR_BINARY = 0.025f;
constexpr float MAX_QUANTIZATION_TOTAL_ERROR_TERNARY = 0.01f;
constexpr float MAX_QUANTIZATION_TOTAL_ERROR_2BITS = 0.0075f;
constexpr float MAX_QUANTIZATION_TOTAL_ERROR_3BITS = 0.0040f;
@@ -24,6 +25,7 @@ constexpr float MAX_QUANTIZATION_TOTAL_ERROR_FP4 = 0.0030f;
constexpr float MAX_DOT_PRODUCT_ERROR = 0.02f;
constexpr float MAX_DOT_PRODUCT_ERROR_LOWBIT = 0.04f;
constexpr float MAX_DOT_PRODUCT_ERROR_FP4 = 0.03f;
constexpr float MAX_DOT_PRODUCT_ERROR_BINARY = 0.40f;
constexpr float MAX_DOT_PRODUCT_ERROR_TERNARY = 0.15f;
static const char* RESULT_STR[] = {"ok", "FAILED"};
@@ -145,6 +147,7 @@ int main(int argc, char * argv[]) {
if (qfns_cpu->from_float && qfns->to_float) {
const float total_error = total_quantization_error(qfns, qfns_cpu, test_size, test_data.data());
const float max_quantization_error =
type == GGML_TYPE_Q1_0 ? MAX_QUANTIZATION_TOTAL_ERROR_BINARY :
type == GGML_TYPE_TQ1_0 ? MAX_QUANTIZATION_TOTAL_ERROR_TERNARY :
type == GGML_TYPE_TQ2_0 ? MAX_QUANTIZATION_TOTAL_ERROR_TERNARY :
type == GGML_TYPE_Q2_K ? MAX_QUANTIZATION_TOTAL_ERROR_2BITS :
@@ -170,6 +173,8 @@ int main(int argc, char * argv[]) {
const float max_allowed_error = type == GGML_TYPE_Q2_K || type == GGML_TYPE_IQ2_XS || type == GGML_TYPE_IQ2_XXS ||
type == GGML_TYPE_IQ3_XXS || type == GGML_TYPE_IQ3_S || type == GGML_TYPE_IQ2_S
? MAX_DOT_PRODUCT_ERROR_LOWBIT
: type == GGML_TYPE_Q1_0
? MAX_DOT_PRODUCT_ERROR_BINARY
: type == GGML_TYPE_TQ1_0 || type == GGML_TYPE_TQ2_0
? MAX_DOT_PRODUCT_ERROR_TERNARY
: type == GGML_TYPE_NVFP4
+2 -2
View File
@@ -176,8 +176,8 @@
| `-rea, --reasoning [on\|off\|auto]` | Use reasoning/thinking in the chat ('on', 'off', or 'auto', default: 'auto' (detect from template))<br/>(env: LLAMA_ARG_REASONING) |
| `--reasoning-budget N` | token budget for thinking: -1 for unrestricted, 0 for immediate end, N>0 for token budget (default: -1)<br/>(env: LLAMA_ARG_THINK_BUDGET) |
| `--reasoning-budget-message MESSAGE` | message injected before the end-of-thinking tag when reasoning budget is exhausted (default: none)<br/>(env: LLAMA_ARG_THINK_BUDGET_MESSAGE) |
| `--chat-template JINJA_TEMPLATE` | set custom jinja chat template (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted (unless --jinja is set before this flag):<br/>list of built-in templates:<br/>bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, deepseek3, exaone-moe, exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite, grok-2, hunyuan-dense, hunyuan-moe, kimi-k2, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch, openchat, orion, pangu-embedded, phi3, phi4, rwkv-world, seed_oss, smolvlm, solar-open, vicuna, vicuna-orca, yandex, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE) |
| `--chat-template-file JINJA_TEMPLATE_FILE` | set custom jinja chat template file (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted (unless --jinja is set before this flag):<br/>list of built-in templates:<br/>bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, deepseek3, exaone-moe, exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite, grok-2, hunyuan-dense, hunyuan-moe, kimi-k2, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch, openchat, orion, pangu-embedded, phi3, phi4, rwkv-world, seed_oss, smolvlm, solar-open, vicuna, vicuna-orca, yandex, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE_FILE) |
| `--chat-template JINJA_TEMPLATE` | set custom jinja chat template (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted (unless --jinja is set before this flag):<br/>list of built-in templates:<br/>bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek-ocr, deepseek2, deepseek3, exaone-moe, exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite, grok-2, hunyuan-dense, hunyuan-moe, kimi-k2, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch, openchat, orion, pangu-embedded, phi3, phi4, rwkv-world, seed_oss, smolvlm, solar-open, vicuna, vicuna-orca, yandex, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE) |
| `--chat-template-file JINJA_TEMPLATE_FILE` | set custom jinja chat template file (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted (unless --jinja is set before this flag):<br/>list of built-in templates:<br/>bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek-ocr, deepseek2, deepseek3, exaone-moe, exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite, grok-2, hunyuan-dense, hunyuan-moe, kimi-k2, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch, openchat, orion, pangu-embedded, phi3, phi4, rwkv-world, seed_oss, smolvlm, solar-open, vicuna, vicuna-orca, yandex, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE_FILE) |
| `--skip-chat-parsing, --no-skip-chat-parsing` | force a pure content parser, even if a Jinja template is specified; model will output everything in the content section, including any reasoning and/or tool calls (default: disabled)<br/>(env: LLAMA_ARG_SKIP_CHAT_PARSING) |
| `--simple-io` | use basic IO for better compatibility in subprocesses and limited consoles |
| `--draft, --draft-n, --draft-max N` | number of tokens to draft for speculative decoding (default: 16)<br/>(env: LLAMA_ARG_DRAFT_MAX) |
+2 -2
View File
@@ -255,8 +255,8 @@ llama-completion.exe -m models\gemma-1.1-7b-it.Q4_K_M.gguf --ignore-eos -n -1
| `-rea, --reasoning [on\|off\|auto]` | Use reasoning/thinking in the chat ('on', 'off', or 'auto', default: 'auto' (detect from template))<br/>(env: LLAMA_ARG_REASONING) |
| `--reasoning-budget N` | token budget for thinking: -1 for unrestricted, 0 for immediate end, N>0 for token budget (default: -1)<br/>(env: LLAMA_ARG_THINK_BUDGET) |
| `--reasoning-budget-message MESSAGE` | message injected before the end-of-thinking tag when reasoning budget is exhausted (default: none)<br/>(env: LLAMA_ARG_THINK_BUDGET_MESSAGE) |
| `--chat-template JINJA_TEMPLATE` | set custom jinja chat template (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted (unless --jinja is set before this flag):<br/>list of built-in templates:<br/>bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, deepseek3, exaone-moe, exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite, grok-2, hunyuan-dense, hunyuan-moe, kimi-k2, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch, openchat, orion, pangu-embedded, phi3, phi4, rwkv-world, seed_oss, smolvlm, solar-open, vicuna, vicuna-orca, yandex, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE) |
| `--chat-template-file JINJA_TEMPLATE_FILE` | set custom jinja chat template file (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted (unless --jinja is set before this flag):<br/>list of built-in templates:<br/>bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, deepseek3, exaone-moe, exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite, grok-2, hunyuan-dense, hunyuan-moe, kimi-k2, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch, openchat, orion, pangu-embedded, phi3, phi4, rwkv-world, seed_oss, smolvlm, solar-open, vicuna, vicuna-orca, yandex, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE_FILE) |
| `--chat-template JINJA_TEMPLATE` | set custom jinja chat template (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted (unless --jinja is set before this flag):<br/>list of built-in templates:<br/>bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek-ocr, deepseek2, deepseek3, exaone-moe, exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite, grok-2, hunyuan-dense, hunyuan-moe, kimi-k2, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch, openchat, orion, pangu-embedded, phi3, phi4, rwkv-world, seed_oss, smolvlm, solar-open, vicuna, vicuna-orca, yandex, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE) |
| `--chat-template-file JINJA_TEMPLATE_FILE` | set custom jinja chat template file (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted (unless --jinja is set before this flag):<br/>list of built-in templates:<br/>bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek-ocr, deepseek2, deepseek3, exaone-moe, exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite, grok-2, hunyuan-dense, hunyuan-moe, kimi-k2, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch, openchat, orion, pangu-embedded, phi3, phi4, rwkv-world, seed_oss, smolvlm, solar-open, vicuna, vicuna-orca, yandex, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE_FILE) |
| `--skip-chat-parsing, --no-skip-chat-parsing` | force a pure content parser, even if a Jinja template is specified; model will output everything in the content section, including any reasoning and/or tool calls (default: disabled)<br/>(env: LLAMA_ARG_SKIP_CHAT_PARSING) |
| `--simple-io` | use basic IO for better compatibility in subprocesses and limited consoles |

Some files were not shown because too many files have changed in this diff Show More