Files
llama.cpp/tools/mtmd/clip-impl.h
T
Gabe Goodhart 64086f2b2f model, mtmd: Granite4 Vision (#23545)
* feat(convert): Get language model conversion working for 4.1 vision

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(convert): Skip multimodal tensors for GraniteMoeHybrid (vision 4.0)

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Disable vocab padding for non-hybrid models that use GraniteMoeHybrid

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Plumb python-side vision projector names and mappings

There are several awkward things here:

1. Most of these are essentially identical to the audio qformer tensors. On
the c++ side, that's mapped using the prefix, so the rest of the GGUF
name needs to align, but on the python side there's no prefix notion, so
they all get duplicated.
2. There are a couple of net-new tensors for vision, in particular
PROJ_NORM. In both speech and vision, the QF_PROJ_NORM is qualified as
belonging to the qformer portion, but the GGUF name is simply proj_norm
which conflicts with the ideal name for this new PROJ_NORM that is not
qualified as part of the qformer. To get around this, I used
"proj_layernorm" as the GGUF name.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add python side architecture name

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add python-side plumbing for setting FEATURE_LAYERS hparam

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add c++ side tensor naming defines

NOTE: Usage of these hasn't been updated to include prefix yet

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(mtmd): Convert vision_feature_layer to an ordered vector

We need to preserve the ordering of these feature index values so that they
can be mapped to the sub-tensors within the stacked projectors.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(mtmd): Add architecture label plumbing

Branch: Granite4Vision
AI-usage: full (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(wip): Add partial conversion for mmproj

This handles stacking the projector tensors and setting the new harams

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add gguf_writer and constant support for new hparams and deepstack layer arr

Branch: Granite4Vision
AI-usage: draft (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Full conversion for mmproj w/ tensor mappings

Branch: Granite4Vision
AI-usage: full (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Add lm_head skip for mmproj for 4.0

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: De-alias text_config architecture in convert_lora_to_gguf.py

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add --trust-remote-code arg to convert_lora_to_gguf.py

This defaults to False, but allows a user to enable it programmaticly
instead of using the interactive prompt.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: De-alias model.language_model. -> model. for lora adapters

Branch: Granite4Vision
AI-usage: full (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Extend language model tensor dealiasing in adapters

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove unnecessary registration for GraniteSpeech in language model

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Plumb through mm prefix formatting for qformer tensors

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Refactor vision projector tensors to use predictor ID as the block

This is cleaner than stacking them. The modeling file hard-codes
single-layer qformers, so we can punt on the multiipule multi-layer
projectors problem.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add spatial offests array hparam conversion

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add stub plumbing for granite vision in mtmd

Branch: Granite4Vision
AI-usage: draft (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add new hparam and tensor naming in clip-impl.h

New hparams:
- KEY_PROJ_SAMPLE_QUERY_SIDE
- KEY_PROJ_SAMPLE_WINDOW_SIDE
- KEY_PROJ_SPATIAL_OFFSETS

New tensors:
- TN_MULTI_PROJ_IMG_POS
- TN_MULTI_PROJ_QUERY
- TN_MULTI_PROJ_LAYERNORM
- TN_MULTI_PROJ_LINEAR
- TN_MULTI_PROJ_NORM

Branch: Granite4Vision
AI-usage: none

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Move deepstack_layer_arr to llm hparam instead of mmproj

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove IS_DEEPSTACK_LAYERS

This appears to have been added during Qwen3 VL
(https://github.com/ggml-org/llama.cpp/pull/16780), but it was never
actually used.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: n_deepstack_layers -> deepstack_layer_arr

The old logic hard coded a correspondence between the first N layers of the
LLM and the 1->N entries in the input embeddings. Now, that relationship is
maintained at loading time if the GGUF value is single-valued. If it is
multi-valued, it loads directly allowing for deepstack layers to be spaced
out throughout the model.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use try/catch for single/multi valued deepstack info

The alternative would be to use get_key_or_arr, but then the single value
would be populated through the entire array and we'd need to detect that
and update it with the right correspondence.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add deepstack injection point for granite LLM

The use of ggml_add here assumes that the elements of inp_embd will be pre-
arranged to be the full embedding length with only the vision-mask'ed
portions non-zero from the projector. This matches how Qwen3VL does it.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: add missing vision attn layernorm eps

Branch: Granite4Vision
AI-usage: full (OpenCode + Qwen 3.6-35B)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Hoist qformer tensors into qf_block and hold a vector for multi-proj

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix missing prefix template for TN_QF_PROJ_LINEAR

It's not strictly necessary since vision uses the blockwise version, but it
makes the loading consistent.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Add embedding scale and image grid pinpoints hparams in conversion

Also remove dead parsing for self._deepstack_layer_arr

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add mtmd KEY_ section for hparams shared with the LLM

In this case, we need the EMBEDDING_SCALE so we can unscale the image
embeddings to compensate for applying embedding scale to the input
embeddings

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Implement c++ hparam parsing

Branch: Granite4Vision
AI-usage: draft (Claude Code)
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Flatten pinpoints in conversion

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Add missing break

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: No reason to have modality prefix for img_pos

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add tensor loading

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(convert): Fix confusion between proj.norm and proj.qformer.layernorm

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use the right portion of speech for tensor loading!

Also plumb through the layernorm -> post_norm naming change

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add logging of deepstack_layers_arr if set

I also changed the print_f output type to int32_t to avoid printing
overflow values for -1. This could cause overflows on the other side, but
I can't imagine a value for any of the current array hparams that would
trigger that.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Make sure input embeddings are cont before f_embedding_scale

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add init and mmproj_embd cases for g4v

The n_mmproj_embd is 1+ to make space for the text embedding and all 8
projectors

Branch: Granite4Vision
AI-usage: draft (Bob)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Invert (h, w) -> (w, h) pinpoints

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Reorder projectors based on llm index and skip the first injection

The multi-projector stack has a strange asymmetry based on how it's
currently implemented for qwen3vl: on the mmproj side, it's all N
projectors, but the output of the "first" (by inp_embd index) projector is
automatically consumed as if it were a standard single-projector mmproj,
so the deepstack portion needs to only contain the 1-N entries.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>

* fix: Fix mmproj hparams in conversion

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>

* fix: Fix ordering/logic for deepstack injection in granite

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>

* fix: Fix preprocessing config to match what the model needs

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>

* wip: Partial port of Eli's implementation

This is still pretty broken, but it's getting closer. It now happily
generates tokens, but the values are quite incorrect still. I suspect it's
caused by the mapping of projectors from safetensors to their respective
orders here.

Also, this implementation breaks encapsulation pretty badly in mtmd_encode.
This will need a big refactor to put the G4V-specific encoding logic
somewhere more appropriate.

Branch: Granite4Vision
AI-usage: draft (Claude Code, Bob)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>

* fix: Fix the pre-scaling on the input embeddings to correctly invert the scale

We've got tokens! They still don't line up quite right, so something's a
little off, but we're getting much closer now.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: invert embedding multiplier -> base_scale at load

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix setting image_resize_pad after new enum introduced

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Add G4V to mmproj mapping in conversion

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Re-add padding disable for non-hybrid hybrid models

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Simplify G4V n_tokens computation

This is slightly more efficient and flexible for when we implement the
unpad cropping. IMO, it's also clearer that it is adding the number of
image_newline tokens (embeddings) to the grid, rather than recomputing the
entire count.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add new clip APIs for post-tile-encoding assembly

Granite 4 Vision uses llava-next style pack-and-unpad which requires
injecting the learned newline after each row of the tile grid. A row here
is a single row of the grid which is composed of (grid_x * cols_per_tile) *
(grid_y * rows_per_tile), so the result is newlines injected in between
individual tile rows, thus not something that can be handled with the
standard llava-uhd block-wise endcoding.

Branch: Granite4Vision
AI-usage: draft (Claude Code + Opus 4.7)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add model interfaces for granite 4 vision assembler

I'm on the fence about the best organization of this. These free functions
allow the per-architecture logic in clip.cpp to access the model-specific
graph building, but they still require a fair bit of model-specific logic
in clip.cpp which is not ideal.

I think a better approach may be to replicate what is done with the
graph builders themselves (and possibly even make the assembler part of the
model's existing graph builder).

Branch: Granite4Vision
AI-usage: full (Claude Code + Opus 4.7)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Remove all g4v-specific branching from mtmd.cpp in favor of clip assembler

Branch: Granite4Vision
AI-usage: full (Claude Code + Opus 4.7)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor(mtmd): Consolidate assembler logic into clip_assembler class family

Just like `clip_graph` is the base class for building the model-specific
encoder graphs, `clip_assembler` will be the base class for building the
model-specific assembler graphs. This allows the assembly pattern to follow
how the encoder pattern is implemented where the model-specific logic lives
in a subclass co-located with the encoder graph builder that gets
constructed by a simple factory method.

Branch: Granite4Vision
AI-usage: full (Claude Code + Opus 4.7)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: Comment improvement

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: granite_vision -> granite4_vision

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove dead codepath for Qwen3VL add_vision_is_deepstack

These pieces were never used on the c++ side (removed there in an earlier
commit), so this is just cleanup that I missed before.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Oops! I did not mean to commit one of my prompt files

But now it's too far back in history to effectively rebase out, even with
interactive and --rebase-merges :(

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Add missing <algorithm> include for std::find

It seems that this was already pulled in on some platforms, but not on
others

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix Flake8 warnings in granite conversion module

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Remove clip_assembler in favor of clip_image_f32.append_token

Per conversation in the PR, the clip_assembler pattern was too invasive.
This is a compromise that limits model-specific blocks to add_media where
each preprocessed tile is annotated with an injection type, after which all
the token counting logic is generic and the newline injection itself is
handled in the graph based on the value for the given tile image.

Branch: Granite4Vision
AI-usage: draft (Bob, OpenCode + Qwen 3.6 35b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor(convert): Split n_deepstack_layers and deepstack_layers (array)

Branch: Granite4Vision
AI-usage: full (Bob, OpenCode + Qwen3.6-35b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor(src): Handle n_deepstack_layers and deepstack_layers GGUF keys

Branch: Granite4Vision
AI-usage: draft (Bob, OpenCode + Qwen3.6-35b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix GGUF key for deepstack_layers_arr

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Remove pre-scaling embeddings and skip scaling for raw embd inputs

This follows how gemma3 and gemma4 handle embedding scaling by skipping the
multiplier for raw input embeddings.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: deepstack_layers(_arr) -> deepstack_mapping(_arr)

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Fully revert changes to n_deepstack_layers and qwen3vl*

Since we're going to keep the GGUF KVs separate, it makes sense to just
keep the hparams separate too to limit the scope of this branch. The down
side is that n_deepstack_layers and deepstack_mapping_arr are potentially
conflicting.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Revert removal of "is_deepstack_layers" GGUF KV

This KV is not used at all on the c++ side, so it's fully dead, but there's
also no need to conflate this cleanup with the addition of G4V.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove unnecessary ggml_cont and build_forward_expand in cbx

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: Clean up comments

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Tighter and more flexible code for g4v_build_block

This could be refactored to look a lot more like granite-speech, but the
overall block constructs before/after the qformer are pretty different, so
for now I'm going to leave it as is and just tighten a bit.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove unnecessary `unordered_set` include

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Add architecture guard on deepstack_mapping_arr printout

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove unnecessary AI-gen comment

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Always initialize deepstack_mapping_arr with -1 values

This was causing `test-llama-archs` to fail, likely due to trying to save
the uninitialized values, then re-loading them. It's safer to always
initialize so that other models don't forget and end up with undefined
behavior.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: Remove TODO about block/vs non-block tensor mapping

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Move is_vision_feature_layer logic into clip_hparams

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Use a bool for append_token

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: Remove unnecessary comment

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove unused get_model api

yikes!

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Rearrange helpers for g4v to be private members and use build_attn

Branch: Granite4Vision
AI-usage: full (Bob, OpenCode + Qwen3.6-35b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix off-by-one in vision layer index

This was inherited from the Claude Code implementation that pushed the
negative index inversion down into the model file.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix norm/post_norm mixup in conversion

face. palm. :(

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: More descriptive tensor names

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Apply PR cleanup for new conversion changes

AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix(convert): Remove duplicate V_ENC_EMBD_IMGNL

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: append_token -> add_newline

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: Comment cleanup

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Cleaner error handling/checking

NOTE: format_string is not available in granite.cpp (and including
clip-impl.h to get it doesn't compile, so I think it violates the intended
encapsulation), so std::stringstream is the simplest answer.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2026-06-05 17:44:59 +02:00

745 lines
29 KiB
C++

#pragma once
#include "ggml.h"
#include "gguf.h"
#include "clip.h"
#include <climits>
#include <cstdarg>
#include <cinttypes>
#include <string>
#include <map>
#include <sstream>
#include <vector>
#include <memory>
// Internal header for clip.cpp
#define MTMD_INTERNAL_HEADER
#define KEY_FTYPE "general.file_type"
#define KEY_NAME "general.name"
#define KEY_DESCRIPTION "general.description"
#define KEY_PROJ_TYPE "clip.projector_type"
#define KEY_HAS_AUDIO_ENC "clip.has_audio_encoder"
#define KEY_HAS_VISION_ENC "clip.has_vision_encoder"
#define KEY_USE_GELU "clip.use_gelu"
#define KEY_USE_SILU "clip.use_silu"
#define KEY_N_EMBD "clip.%s.embedding_length"
#define KEY_N_FF "clip.%s.feed_forward_length"
#define KEY_N_BLOCK "clip.%s.block_count"
#define KEY_PROJ_DIM "clip.%s.projection_dim"
#define KEY_N_HEAD "clip.%s.attention.head_count"
#define KEY_N_HEAD_KV "clip.%s.attention.head_count_kv"
#define KEY_LAYER_NORM_EPS "clip.%s.attention.layer_norm_epsilon"
// vision-specific
#define KEY_VISION_PROJ_TYPE "clip.vision.projector_type" // for models with mixed modalities
#define KEY_IMAGE_SIZE "clip.vision.image_size"
#define KEY_IMAGE_MIN_PIXELS "clip.vision.image_min_pixels"
#define KEY_IMAGE_MAX_PIXELS "clip.vision.image_max_pixels"
#define KEY_PREPROC_MIN_TILES "clip.vision.preproc_min_tiles"
#define KEY_PREPROC_MAX_TILES "clip.vision.preproc_max_tiles"
#define KEY_PREPROC_IMAGE_SIZE "clip.vision.preproc_image_size"
#define KEY_PATCH_SIZE "clip.vision.patch_size"
#define KEY_IMAGE_MEAN "clip.vision.image_mean"
#define KEY_IMAGE_STD "clip.vision.image_std"
#define KEY_FEATURE_LAYER "clip.vision.feature_layer"
#define KEY_PROJ_SCALE_FACTOR "clip.vision.projector.scale_factor"
#define KEY_PROJ_SAMPLE_QUERY_SIDE "clip.vision.projector.query_side"
#define KEY_PROJ_SAMPLE_WINDOW_SIDE "clip.vision.projector.window_side"
#define KEY_PROJ_SPATIAL_OFFSETS "clip.vision.projector.spatial_offsets"
#define KEY_SPATIAL_MERGE_SIZE "clip.vision.spatial_merge_size"
#define KEY_MM_PATCH_MERGE_TYPE "clip.vision.mm_patch_merge_type"
#define KEY_IMAGE_GRID_PINPOINTS "clip.vision.image_grid_pinpoints"
#define KEY_WIN_ATTN_PATTERN "clip.vision.n_wa_pattern"
#define KEY_WIN_ATTN_LAYER_INDEXES "clip.vision.wa_layer_indexes"
#define KEY_WA_PATTERN_MODE "clip.vision.wa_pattern_mode"
#define KEY_ATTN_WINDOW_SIZE "clip.vision.window_size"
#define KEY_MINICPMV_VERSION "clip.minicpmv_version"
#define KEY_MINICPMV_QUERY_NUM "clip.minicpmv_query_num"
#define KEY_SAM_N_HEAD "clip.vision.sam.head_count"
#define KEY_SAM_N_BLOCK "clip.vision.sam.block_count"
#define KEY_SAM_N_EMBD "clip.vision.sam.embedding_length"
// audio-specific
#define KEY_AUDIO_PROJ_TYPE "clip.audio.projector_type" // for models with mixed modalities
#define KEY_A_NUM_MEL_BINS "clip.audio.num_mel_bins"
#define KEY_A_PROJ_STACK_FACTOR "clip.audio.projector.stack_factor"
#define KEY_A_CHUNK_SIZE "clip.audio.chunk_size"
#define KEY_A_CONV_KERNEL_SIZE "clip.audio.conv_kernel_size"
#define KEY_A_MAX_POS_EMB "clip.audio.max_pos_emb"
#define KEY_A_PROJ_WINDOW_SIZE "clip.audio.projector.window_size"
#define KEY_A_PROJ_DOWNSAMPLE_RATE "clip.audio.projector.downsample_rate"
#define KEY_A_PROJ_HEAD_COUNT "clip.audio.projector.head_count"
//
// tensor name constants
//
#define TN_POS_EMBD "%s.position_embd.weight"
#define TN_CLASS_EMBD "v.class_embd"
#define TN_PATCH_EMBD "v.patch_embd.weight" // not rename tensor with ".0" postfix for backward compat
#define TN_PATCH_EMBD_1 "v.patch_embd.weight.1"
#define TN_PATCH_BIAS "v.patch_embd.bias"
#define TN_NORM_EMBD "v.norm_embd.%s"
#define TN_PATCH_NORM "v.patch_norm.%d.%s"
#define TN_ATTN_QKV "%s.blk.%d.attn_qkv.%s"
#define TN_ATTN_K "%s.blk.%d.attn_k.%s"
#define TN_ATTN_Q "%s.blk.%d.attn_q.%s"
#define TN_ATTN_V "%s.blk.%d.attn_v.%s"
#define TN_ATTN_OUTPUT "%s.blk.%d.attn_out.%s"
#define TN_ATTN_SINKS "%s.blk.%d.attn_sinks"
#define TN_ATTN_K_NORM "%s.blk.%d.attn_k_norm.%s"
#define TN_ATTN_Q_NORM "%s.blk.%d.attn_q_norm.%s"
#define TN_FFN_DOWN "%s.blk.%d.ffn_down.%s"
#define TN_FFN_GATE "%s.blk.%d.ffn_gate.%s"
#define TN_FFN_UP "%s.blk.%d.ffn_up.%s"
#define TN_FFN_GATE "%s.blk.%d.ffn_gate.%s"
#define TN_LN_1 "%s.blk.%d.ln1.%s" // layer norm
#define TN_LN_2 "%s.blk.%d.ln2.%s" // layer norm
#define TN_LS_1 "%s.blk.%d.ls1.%s" // layer scale
#define TN_LS_2 "%s.blk.%d.ls2.%s" // layer scale
#define TN_LS_OUT "%s.blk.%d.out_scale.%s" // layer out scale (gemma4)
#define TN_ATTN_POST_NORM "%s.blk.%d.attn_post_norm.%s" // post-attn norm (gemma4)
#define TN_FFN_POST_NORM "%s.blk.%d.ffn_post_norm.%s" // post-FFN norm (gemma4)
#define TN_LN_PRE "%s.pre_ln.%s"
#define TN_LN_POST "%s.post_ln.%s"
#define TN_LLAVA_PROJ "mm.%d.%s"
#define TN_MM_UP "mm.up.%s"
#define TN_MM_GATE "mm.gate.%s"
#define TN_MM_DOWN "mm.down.%s"
#define TN_MM_POST_NORM "mm.post_norm.%s"
#define TN_MVLM_PROJ_MLP "mm.model.mlp.%d.%s"
#define TN_MVLM_PROJ_BLOCK "mm.model.mb_block.%d.block.%d.%s"
#define TN_MVLM_PROJ_PEG "mm.model.peg.%d.%s"
#define TN_IMAGE_NEWLINE "v.image_newline"
#define TN_IMAGE_SEPERATOR "v.view_seperator"
#define TN_MM_INP_NORM "mm.input_norm.weight"
#define TN_MM_INP_NORM_B "mm.input_norm.bias"
#define TN_MM_INP_PROJ "mm.input_projection.weight" // gemma3
#define TN_MM_SOFT_EMB_N "mm.soft_emb_norm.weight" // gemma3
#define TN_MM_PROJECTOR "mm.model.fc.%s" // idefics3, deepseekocr
#define TN_MM_PATCH_MERGER "mm.patch_merger.%s" // mistral small 3.1, glm4v
#define TN_TOK_IMG_BREAK "v.token_embd.img_break" // pixtral
#define TN_TOK_GLM_BOI "adapter.boi" // glm-edge (these embeddings are not in text model)
#define TN_TOK_GLM_EOI "adapter.eoi" // glm-edge (these embeddings are not in text model)
#define TN_DEEPSTACK_NORM "v.deepstack.%d.norm.%s" // qwen3vl deepstack
#define TN_DEEPSTACK_FC1 "v.deepstack.%d.fc1.%s" // qwen3vl deepstack
#define TN_DEEPSTACK_FC2 "v.deepstack.%d.fc2.%s" // qwen3vl deepstack
// mimicpmv
#define TN_MINICPMV_POS_EMBD_K "resampler.pos_embed_k"
#define TN_MINICPMV_QUERY "resampler.query"
#define TN_MINICPMV_PROJ "resampler.proj.weight"
#define TN_MINICPMV_KV_PROJ "resampler.kv.weight"
#define TN_MINICPMV_ATTN "resampler.attn.%s.%s"
#define TN_MINICPMV_LN "resampler.ln_%s.%s"
// MiniCPM-V 4.6 ViT merger (window attention + MLP downsample),
// matching the upstream `vit_merger` module name in transformers.
#define TN_VIT_MERGER_LN1 "v.vit_merger.ln1.%s"
#define TN_VIT_MERGER_ATTN_Q "v.vit_merger.attn_q.%s"
#define TN_VIT_MERGER_ATTN_K "v.vit_merger.attn_k.%s"
#define TN_VIT_MERGER_ATTN_V "v.vit_merger.attn_v.%s"
#define TN_VIT_MERGER_ATTN_O "v.vit_merger.attn_out.%s"
#define TN_VIT_MERGER_DS_LN "v.vit_merger.ds_ln.%s"
#define TN_VIT_MERGER_DS_UP "v.vit_merger.ds_ffn_up.%s"
#define TN_VIT_MERGER_DS_DOWN "v.vit_merger.ds_ffn_down.%s"
#define TN_GLM_ADAPER_CONV "adapter.conv.%s"
#define TN_GLM_ADAPTER_LINEAR "adapter.linear.linear.%s"
#define TN_GLM_ADAPTER_NORM_1 "adapter.linear.norm1.%s"
#define TN_GLM_ADAPTER_D_H_2_4H "adapter.linear.dense_h_to_4h.%s"
#define TN_GLM_ADAPTER_GATE "adapter.linear.gate.%s"
#define TN_GLM_ADAPTER_D_4H_2_H "adapter.linear.dense_4h_to_h.%s"
// ultravox
#define TN_CONV1D "a.conv1d.%d.%s"
#define TN_CONV2D "a.conv2d.%d.%s"
#define TN_CONV_OUT "a.conv_out.%s"
#define TN_MM_AUDIO_MLP "mm.a.mlp.%d.%s"
#define TN_MM_AUDIO_FC "mm.a.fc.%s" // fully connected layer
#define TN_MM_NORM_PRE "mm.a.norm_pre.%s"
#define TN_MM_NORM_MID "mm.a.norm_mid.%s"
// cogvlm
#define TN_MM_POST_FC_NORM "mm.post_fc_norm.%s"
#define TN_MM_H_TO_4H "mm.up.%s"
#define TN_MM_GATE "mm.gate.%s"
#define TN_MM_4H_TO_H "mm.down.%s"
#define TN_TOK_BOI "v.boi"
#define TN_TOK_EOI "v.eoi"
// hunyuanvl (shared GGUF tensor names)
#define TN_MM_PRE_NORM "mm.pre_norm.%s"
#define TN_TOK_IMG_BEGIN "mm.image_begin"
#define TN_TOK_IMG_END "mm.image_end"
// deepseek-ocr
#define TN_SAM_POS_EMBD "v.sam.pos_embd.%s"
#define TN_SAM_PATCH_EMBD "v.sam.patch_embd.%s"
#define TN_SAM_PRE_NORM "v.sam.blk.%d.pre_ln.%s"
#define TN_SAM_POST_NORM "v.sam.blk.%d.post_ln.%s"
#define TN_SAM_ATTN_POS_H "v.sam.blk.%d.attn.pos_h.%s"
#define TN_SAM_ATTN_POS_W "v.sam.blk.%d.attn.pos_w.%s"
#define TN_SAM_ATTN_QKV "v.sam.blk.%d.attn.qkv.%s"
#define TN_SAM_ATTN_OUT "v.sam.blk.%d.attn.out.%s"
#define TN_SAM_FFN_UP "v.sam.blk.%d.mlp.lin1.%s"
#define TN_SAM_FFN_DOWN "v.sam.blk.%d.mlp.lin2.%s"
#define TN_SAM_NECK "v.sam.neck.%d.%s"
#define TN_SAM_NET "v.sam.net_%d.%s"
// deepseek-ocr-2
#define TN_RESMPL_QUERY "v.resample_query_%d.%s"
// (conformer) lfm2
#define TN_PRE_ENCODE_OUT "a.pre_encode.out.%s"
#define TN_FFN_NORM "%s.blk.%d.ffn_norm.%s"
#define TN_FFN_NORM_1 "%s.blk.%d.ffn_norm_1.%s"
#define TN_FFN_UP_1 "%s.blk.%d.ffn_up_1.%s"
#define TN_FFN_DOWN_1 "%s.blk.%d.ffn_down_1.%s"
#define TN_POS_BIAS_U "%s.blk.%d.pos_bias_u"
#define TN_POS_BIAS_V "%s.blk.%d.pos_bias_v"
#define TN_NORM_CONV "%s.blk.%d.norm_conv.%s"
#define TN_LINEAR_POS "%s.blk.%d.linear_pos.%s"
#define TN_CONV_DW "%s.blk.%d.conv_dw.%s"
#define TN_CONV_NORM "%s.blk.%d.conv_norm.%s"
#define TN_CONV_PW1 "%s.blk.%d.conv_pw1.%s"
#define TN_CONV_PW2 "%s.blk.%d.conv_pw2.%s"
#define TN_INP_PROJ "a.input_projection.%s"
#define TN_CTC_OUT "a.enc_ctc_out.%s"
#define TN_CTC_OUT_MID "a.enc_ctc_out_mid.%s"
#define TN_ATTN_REL_POS_EMB "%s.blk.%d.attn_rel_pos_emb"
// qformer projector
#define TN_QF_PROJ_QUERY "%s.proj_query"
#define TN_QF_PROJ_NORM "%s.proj_norm.%s"
#define TN_QF_PROJ_LINEAR "%s.proj_linear.%s"
#define TN_QF_SELF_ATTN_Q "%s.proj_blk.%d.self_attn_q.%s"
#define TN_QF_SELF_ATTN_K "%s.proj_blk.%d.self_attn_k.%s"
#define TN_QF_SELF_ATTN_V "%s.proj_blk.%d.self_attn_v.%s"
#define TN_QF_SELF_ATTN_O "%s.proj_blk.%d.self_attn_out.%s"
#define TN_QF_SELF_ATTN_N "%s.proj_blk.%d.self_attn_norm.%s"
#define TN_QF_CROSS_ATTN_Q "%s.proj_blk.%d.cross_attn_q.%s"
#define TN_QF_CROSS_ATTN_K "%s.proj_blk.%d.cross_attn_k.%s"
#define TN_QF_CROSS_ATTN_V "%s.proj_blk.%d.cross_attn_v.%s"
#define TN_QF_CROSS_ATTN_O "%s.proj_blk.%d.cross_attn_out.%s"
#define TN_QF_CROSS_ATTN_N "%s.proj_blk.%d.cross_attn_norm.%s"
#define TN_QF_FFN_UP "%s.proj_blk.%d.ffn_up.%s"
#define TN_QF_FFN_DOWN "%s.proj_blk.%d.ffn_down.%s"
#define TN_QF_FFN_NORM "%s.proj_blk.%d.ffn_norm.%s"
// multi-projector qformer (bid => projector ID)
#define TN_MULTI_PROJ_IMG_POS "v.proj_blk.%d.img_pos"
#define TN_MULTI_PROJ_QUERY "%s.proj_blk.%d.query"
#define TN_MULTI_PROJ_LINEAR "%s.proj_blk.%d.linear.%s"
#define TN_MULTI_PROJ_NORM "%s.proj_blk.%d.norm.%s"
#define TN_MULTI_PROJ_POST_NORM "%s.proj_blk.%d.post_norm.%s"
// gemma4 audio conformer
#define TN_A_MM_INP_PROJ "mm.a.input_projection.%s"
#define TN_A_MM_SOFT_EMB_N "mm.a.soft_emb_norm.%s"
#define TN_A_INP_PROJ "a.input_projection.%s"
#define TN_A_CONV1D "a.conv1d.%d.%s"
#define TN_A_CONV1D_NORM "a.conv1d.%d.norm.%s"
#define TN_A_OUT_PROJ "a.pre_encode.out.%s"
#define TN_A_ATTN_PRE_NORM "%s.blk.%d.attn_pre_norm.%s"
#define TN_A_ATTN_POST_NORM "%s.blk.%d.attn_post_norm.%s"
#define TN_A_ATTN_K_REL "%s.blk.%d.attn_k_rel.%s"
#define TN_A_PER_DIM_SCALE "%s.blk.%d.per_dim_scale.%s"
#define TN_A_PER_DIM_K_SCALE "%s.blk.%d.per_dim_k_scale.%s"
#define TN_A_FFN_POST_NORM "%s.blk.%d.ffn_post_norm.%s"
#define TN_A_FFN_POST_NORM_1 "%s.blk.%d.ffn_post_norm_1.%s"
// mobilenetv5 (gemma3n) definitions
#define TN_MNV5_STEM_CONV "v.conv_stem.conv.weight"
#define TN_MNV5_STEM_BIAS "v.conv_stem.conv.bias"
#define TN_MNV5_STEM_BN "v.conv_stem.bn.weight"
// Stage 0 Block (Edge Residual)
#define TN_MNV5_BLK_S0_EXP_W "v.blk.%d.%d.conv_exp.weight"
#define TN_MNV5_BLK_S0_BN1_W "v.blk.%d.%d.bn1.weight"
#define TN_MNV5_BLK_S0_PWL_W "v.blk.%d.%d.conv_pwl.weight"
#define TN_MNV5_BLK_S0_BN2_W "v.blk.%d.%d.bn2.weight"
// Stage 1+ Block (Universal Inverted Residual)
#define TN_MNV5_BLK_DW_START_W "v.blk.%d.%d.dw_start.conv.weight"
#define TN_MNV5_BLK_DW_START_BN "v.blk.%d.%d.dw_start.bn.weight"
#define TN_MNV5_BLK_DW_MID_W "v.blk.%d.%d.dw_mid.conv.weight"
#define TN_MNV5_BLK_DW_MID_BN "v.blk.%d.%d.dw_mid.bn.weight"
#define TN_MNV5_BLK_PW_EXP_W "v.blk.%d.%d.pw_exp.conv.weight"
#define TN_MNV5_BLK_PW_EXP_BN "v.blk.%d.%d.pw_exp.bn.weight"
#define TN_MNV5_BLK_PW_PROJ_W "v.blk.%d.%d.pw_proj.conv.weight"
#define TN_MNV5_BLK_PW_PROJ_BN "v.blk.%d.%d.pw_proj.bn.weight"
#define TN_MNV5_BLK_LAYER_SCALE "v.blk.%d.%d.layer_scale.gamma"
// Attention Components
#define TN_MNV5_ATTN_Q_W "v.blk.%d.%d.attn.query.proj.weight"
#define TN_MNV5_ATTN_K_W "v.blk.%d.%d.attn.key.proj.weight"
#define TN_MNV5_ATTN_V_W "v.blk.%d.%d.attn.value.proj.weight"
#define TN_MNV5_ATTN_O_W "v.blk.%d.%d.attn.output.proj.weight"
#define TN_MNV5_ATTN_K_DW "v.blk.%d.%d.attn.key.down_conv.weight"
#define TN_MNV5_ATTN_K_NORM "v.blk.%d.%d.attn.key.norm.weight"
#define TN_MNV5_ATTN_V_DW "v.blk.%d.%d.attn.value.down_conv.weight"
#define TN_MNV5_ATTN_V_NORM "v.blk.%d.%d.attn.value.norm.weight"
#define TN_MNV5_ATTN_NORM "v.blk.%d.%d.norm.weight" // Block norm used in attn blocks
// MSFA
#define TN_MNV5_MSFA_FFN_EXP_W "v.msfa.ffn.pw_exp.conv.weight"
#define TN_MNV5_MSFA_FFN_EXP_BN "v.msfa.ffn.pw_exp.bn.weight"
#define TN_MNV5_MSFA_FFN_PROJ_W "v.msfa.ffn.pw_proj.conv.weight"
#define TN_MNV5_MSFA_FFN_PROJ_BN "v.msfa.ffn.pw_proj.bn.weight"
#define TN_MNV5_MSFA_NORM "v.msfa.norm.weight"
// gemma4
#define TN_STD_BIAS "v.std_bias"
#define TN_STD_SCALE "v.std_scale"
// yasa2
#define TN_YASA_PATCH_LN_W "v.patch_ln.weight"
#define TN_YASA_PATCH_LN_B "v.patch_ln.bias"
#define TN_YASA_BACKBONE_LN_W "v.backbone_ln.weight"
#define TN_YASA_BACKBONE_LN_B "v.backbone_ln.bias"
#define TN_YASA_POS_EMBD "v.vision_pos_embed"
#define TN_YASA_STAGE_DOWN_LN "v.stage.%d.down.ln.%s"
#define TN_YASA_STAGE_DOWN_CONV "v.stage.%d.down.conv.%s"
#define TN_YASA_STAGE_BLK "v.stage.%d.blk.%d.%s.%s"
// align x to upper multiple of n
#define CLIP_ALIGN(x, n) ((((x) + (n) - 1) / (n)) * (n))
// forward declaration
// TODO: improve this later
struct clip_ctx;
enum projector_type {
PROJECTOR_TYPE_MLP,
PROJECTOR_TYPE_MLP_NORM,
PROJECTOR_TYPE_LDP,
PROJECTOR_TYPE_LDPV2,
PROJECTOR_TYPE_MINICPMV,
PROJECTOR_TYPE_GLM_EDGE,
PROJECTOR_TYPE_QWEN2VL,
PROJECTOR_TYPE_QWEN3VL,
PROJECTOR_TYPE_STEP3VL,
PROJECTOR_TYPE_GEMMA3,
PROJECTOR_TYPE_GEMMA3NV,
PROJECTOR_TYPE_GEMMA3NA,
PROJECTOR_TYPE_GEMMA4V,
PROJECTOR_TYPE_GEMMA4A,
PROJECTOR_TYPE_GEMMA4UV,
PROJECTOR_TYPE_GEMMA4UA,
PROJECTOR_TYPE_PHI4,
PROJECTOR_TYPE_IDEFICS3,
PROJECTOR_TYPE_PIXTRAL,
PROJECTOR_TYPE_QWEN25VL,
PROJECTOR_TYPE_ULTRAVOX,
PROJECTOR_TYPE_INTERNVL,
PROJECTOR_TYPE_LLAMA4,
PROJECTOR_TYPE_QWEN2A,
PROJECTOR_TYPE_QWEN3A,
PROJECTOR_TYPE_GLMA,
PROJECTOR_TYPE_QWEN25O, // will be replaced by QWEN2A or QWEN25VL depending on clip_ctx
PROJECTOR_TYPE_VOXTRAL,
PROJECTOR_TYPE_MERALION,
PROJECTOR_TYPE_MUSIC_FLAMINGO,
PROJECTOR_TYPE_LFM2,
PROJECTOR_TYPE_KIMIVL,
PROJECTOR_TYPE_PADDLEOCR,
PROJECTOR_TYPE_LIGHTONOCR,
PROJECTOR_TYPE_COGVLM,
PROJECTOR_TYPE_JANUS_PRO,
PROJECTOR_TYPE_DOTS_OCR,
PROJECTOR_TYPE_DEEPSEEKOCR,
PROJECTOR_TYPE_DEEPSEEKOCR2,
PROJECTOR_TYPE_LFM2A,
PROJECTOR_TYPE_GLM4V,
PROJECTOR_TYPE_YOUTUVL,
PROJECTOR_TYPE_YASA2,
PROJECTOR_TYPE_KIMIK25,
PROJECTOR_TYPE_NEMOTRON_V2_VL,
PROJECTOR_TYPE_HUNYUANVL,
PROJECTOR_TYPE_EXAONE4_5,
PROJECTOR_TYPE_MINICPMV4_6,
PROJECTOR_TYPE_GRANITE_SPEECH,
PROJECTOR_TYPE_MIMOVL,
PROJECTOR_TYPE_GRANITE4_VISION,
PROJECTOR_TYPE_UNKNOWN,
};
static std::map<projector_type, std::string> PROJECTOR_TYPE_NAMES = {
{ PROJECTOR_TYPE_MLP, "mlp" },
{ PROJECTOR_TYPE_LDP, "ldp" },
{ PROJECTOR_TYPE_LDPV2, "ldpv2"},
{ PROJECTOR_TYPE_MINICPMV, "resampler"},
{ PROJECTOR_TYPE_GLM_EDGE, "adapter"},
{ PROJECTOR_TYPE_QWEN2VL, "qwen2vl_merger"},
{ PROJECTOR_TYPE_QWEN25VL, "qwen2.5vl_merger"},
{ PROJECTOR_TYPE_QWEN3VL, "qwen3vl_merger"},
{ PROJECTOR_TYPE_STEP3VL, "step3vl"},
{ PROJECTOR_TYPE_GEMMA3, "gemma3"},
{ PROJECTOR_TYPE_GEMMA3NV, "gemma3nv"},
{ PROJECTOR_TYPE_GEMMA3NA, "gemma3na"},
{ PROJECTOR_TYPE_GEMMA4V, "gemma4v"},
{ PROJECTOR_TYPE_GEMMA4A, "gemma4a"},
{ PROJECTOR_TYPE_GEMMA4UV, "gemma4uv"},
{ PROJECTOR_TYPE_GEMMA4UA, "gemma4ua"},
{ PROJECTOR_TYPE_PHI4, "phi4"},
{ PROJECTOR_TYPE_IDEFICS3, "idefics3"},
{ PROJECTOR_TYPE_PIXTRAL, "pixtral"},
{ PROJECTOR_TYPE_ULTRAVOX, "ultravox"},
{ PROJECTOR_TYPE_INTERNVL, "internvl"},
{ PROJECTOR_TYPE_LLAMA4, "llama4"},
{ PROJECTOR_TYPE_QWEN2A, "qwen2a"},
{ PROJECTOR_TYPE_QWEN3A, "qwen3a"},
{ PROJECTOR_TYPE_GLMA, "glma"},
{ PROJECTOR_TYPE_QWEN25O, "qwen2.5o"},
{ PROJECTOR_TYPE_VOXTRAL, "voxtral"},
{ PROJECTOR_TYPE_MERALION, "meralion"},
{ PROJECTOR_TYPE_MUSIC_FLAMINGO, "musicflamingo"},
{ PROJECTOR_TYPE_LFM2, "lfm2"},
{ PROJECTOR_TYPE_KIMIVL, "kimivl"},
{ PROJECTOR_TYPE_PADDLEOCR, "paddleocr"},
{ PROJECTOR_TYPE_LIGHTONOCR,"lightonocr"},
{ PROJECTOR_TYPE_COGVLM, "cogvlm"},
{ PROJECTOR_TYPE_JANUS_PRO, "janus_pro"},
{ PROJECTOR_TYPE_DOTS_OCR, "dots_ocr"},
{ PROJECTOR_TYPE_DEEPSEEKOCR,"deepseekocr"},
{ PROJECTOR_TYPE_DEEPSEEKOCR2,"deepseekocr2"},
{ PROJECTOR_TYPE_LFM2A, "lfm2a"},
{ PROJECTOR_TYPE_GLM4V, "glm4v"},
{ PROJECTOR_TYPE_YOUTUVL, "youtuvl"},
{ PROJECTOR_TYPE_YASA2, "yasa2"},
{ PROJECTOR_TYPE_KIMIK25, "kimik25"},
{ PROJECTOR_TYPE_NEMOTRON_V2_VL, "nemotron_v2_vl"},
{ PROJECTOR_TYPE_EXAONE4_5, "exaone4_5"},
{ PROJECTOR_TYPE_HUNYUANVL, "hunyuanvl"},
{ PROJECTOR_TYPE_MINICPMV4_6, "minicpmv4_6"},
{ PROJECTOR_TYPE_GRANITE_SPEECH, "granite_speech"},
{ PROJECTOR_TYPE_MIMOVL, "mimovl"},
{ PROJECTOR_TYPE_GRANITE4_VISION, "granite4_vision"},
};
static projector_type clip_projector_type_from_string(const std::string & str) {
for (const auto & pair : PROJECTOR_TYPE_NAMES) {
if (pair.second == str) {
return pair.first;
}
}
return PROJECTOR_TYPE_UNKNOWN;
}
// RGB uint8 image
struct clip_image_u8 {
int nx;
int ny;
std::vector<uint8_t> buf;
};
// For images, buf.size() == nx*ny*3
// Memory layout: RGBRGBRGB...
// For audio, only one channel is used, buf.size() == nx*ny
// nx will be n_frames and ny will be n_mel
struct clip_image_f32 {
int nx;
int ny;
std::vector<float> buf;
// marks the global view in e.g., DeepSeek-OCR Models
bool add_viewsep = false;
// whether a learned newline token should be appended after the image (eg Granite4 Vision)
bool add_newline = false;
};
//
// logging
//
static void clip_log_callback_default(enum ggml_log_level level, const char * text, void * user_data) {
(void) level;
(void) user_data;
fputs(text, stderr);
fflush(stderr);
}
struct clip_logger_state {
ggml_log_callback log_callback;
void * log_callback_user_data;
};
extern struct clip_logger_state g_logger_state;
static void clip_log_internal_v(enum ggml_log_level level, const char * format, va_list args) {
if (format == NULL) {
return;
}
va_list args_copy;
va_copy(args_copy, args);
char buffer[128];
int len = vsnprintf(buffer, 128, format, args);
if (len < 128) {
g_logger_state.log_callback(level, buffer, g_logger_state.log_callback_user_data);
} else {
char * buffer2 = (char *) calloc(len + 1, sizeof(char));
vsnprintf(buffer2, len + 1, format, args_copy);
buffer2[len] = 0;
g_logger_state.log_callback(level, buffer2, g_logger_state.log_callback_user_data);
free(buffer2);
}
va_end(args_copy);
}
static void clip_log_internal(enum ggml_log_level level, const char * format, ...) {
va_list args;
va_start(args, format);
clip_log_internal_v(level, format, args);
va_end(args);
}
#define LOG_DBG(...) clip_log_internal(GGML_LOG_LEVEL_DEBUG, __VA_ARGS__)
#define LOG_INF(...) clip_log_internal(GGML_LOG_LEVEL_INFO, __VA_ARGS__)
#define LOG_WRN(...) clip_log_internal(GGML_LOG_LEVEL_WARN, __VA_ARGS__)
#define LOG_ERR(...) clip_log_internal(GGML_LOG_LEVEL_ERROR, __VA_ARGS__)
#define LOG_CNT(...) clip_log_internal(GGML_LOG_LEVEL_CONT, __VA_ARGS__)
//
// cpp wrappers
//
// wrapper for clip_image_size
struct clip_image_size_deleter {
void operator()(clip_image_size * val) { clip_image_size_free(val); }
};
typedef std::unique_ptr<clip_image_size, clip_image_size_deleter> clip_image_size_ptr;
// wrapper for clip_image_u8
struct clip_image_u8_deleter {
void operator()(clip_image_u8 * val) { clip_image_u8_free(val); }
};
typedef std::unique_ptr<clip_image_u8, clip_image_u8_deleter> clip_image_u8_ptr;
// wrapper for clip_image_f32
struct clip_image_f32_deleter {
void operator()(clip_image_f32 * val) { clip_image_f32_free(val); }
};
typedef std::unique_ptr<clip_image_f32, clip_image_f32_deleter> clip_image_f32_ptr;
struct clip_image_u8_batch {
std::vector<clip_image_u8_ptr> entries;
};
struct clip_image_f32_batch {
std::vector<clip_image_f32_ptr> entries;
bool is_audio = false;
// for llava-uhd style models, we need to know the grid size
// note: entries.size() == grid_x * grid_y + 1 (one overview image)
int grid_x = 0;
int grid_y = 0;
clip_image_f32_batch clone() const {
clip_image_f32_batch new_batch{
/* entries */ {},
/* is_audio */ is_audio,
/* grid_x */ grid_x,
/* grid_y */ grid_y,
};
new_batch.entries.reserve(entries.size());
for (const auto & entry : entries) {
new_batch.entries.emplace_back(new clip_image_f32(*entry));
}
return new_batch;
}
};
//
// common utils
//
static std::string string_format(const char * fmt, ...) {
va_list ap;
va_list ap2;
va_start(ap, fmt);
va_copy(ap2, ap);
int size = vsnprintf(NULL, 0, fmt, ap);
GGML_ASSERT(size >= 0 && size < INT_MAX); // NOLINT
std::vector<char> buf(size + 1);
int size2 = vsnprintf(buf.data(), size + 1, fmt, ap2);
GGML_ASSERT(size2 == size);
va_end(ap2);
va_end(ap);
return std::string(buf.data(), buf.size());
}
static void string_replace_all(std::string & s, const std::string & search, const std::string & replace) {
if (search.empty()) {
return;
}
std::string builder;
builder.reserve(s.length());
size_t pos = 0;
size_t last_pos = 0;
while ((pos = s.find(search, last_pos)) != std::string::npos) {
builder.append(s, last_pos, pos - last_pos);
builder.append(replace);
last_pos = pos + search.length();
}
builder.append(s, last_pos, std::string::npos);
s = std::move(builder);
}
// split string by a `std::string delim` instead of `char delim`
static std::vector<std::string> string_split_str(std::string s, const std::string & delimiter) {
std::vector<std::string> tokens;
size_t pos = 0;
std::string token;
while ((pos = s.find(delimiter)) != std::string::npos) {
token = s.substr(0, pos);
tokens.push_back(token);
s.erase(0, pos + delimiter.length());
}
tokens.push_back(s);
return tokens;
}
// remove when moving to c++20
inline bool string_starts_with(std::string_view str, std::string_view prefix) {
return str.size() >= prefix.size() &&
str.compare(0, prefix.size(), prefix) == 0;
}
// remove when moving to c++20
inline bool string_ends_with(std::string_view str, std::string_view suffix) {
return str.size() >= suffix.size() &&
str.compare(str.size() - suffix.size(), suffix.size(), suffix) == 0;
}
//
// gguf utils
//
static std::string gguf_data_to_str(enum gguf_type type, const void * data, int i) {
switch (type) {
case GGUF_TYPE_UINT8: return std::to_string(((const uint8_t *)data)[i]);
case GGUF_TYPE_INT8: return std::to_string(((const int8_t *)data)[i]);
case GGUF_TYPE_UINT16: return std::to_string(((const uint16_t *)data)[i]);
case GGUF_TYPE_INT16: return std::to_string(((const int16_t *)data)[i]);
case GGUF_TYPE_UINT32: return std::to_string(((const uint32_t *)data)[i]);
case GGUF_TYPE_INT32: return std::to_string(((const int32_t *)data)[i]);
case GGUF_TYPE_UINT64: return std::to_string(((const uint64_t *)data)[i]);
case GGUF_TYPE_INT64: return std::to_string(((const int64_t *)data)[i]);
case GGUF_TYPE_FLOAT32: return std::to_string(((const float *)data)[i]);
case GGUF_TYPE_FLOAT64: return std::to_string(((const double *)data)[i]);
case GGUF_TYPE_BOOL: return ((const int8_t *)data)[i] != 0 ? "true" : "false";
default: return string_format("unknown type %d", type);
}
}
static std::string gguf_kv_to_str(const struct gguf_context * ctx_gguf, int i) {
const enum gguf_type type = gguf_get_kv_type(ctx_gguf, i);
switch (type) {
case GGUF_TYPE_STRING:
return gguf_get_val_str(ctx_gguf, i);
case GGUF_TYPE_ARRAY:
{
const enum gguf_type arr_type = gguf_get_arr_type(ctx_gguf, i);
int arr_n = gguf_get_arr_n(ctx_gguf, i);
const void * data = arr_type == GGUF_TYPE_STRING ? nullptr : gguf_get_arr_data(ctx_gguf, i);
std::stringstream ss;
ss << "[";
for (int j = 0; j < arr_n; j++) {
if (arr_type == GGUF_TYPE_STRING) {
std::string val = gguf_get_arr_str(ctx_gguf, i, j);
// escape quotes
string_replace_all(val, "\\", "\\\\");
string_replace_all(val, "\"", "\\\"");
ss << '"' << val << '"';
} else if (arr_type == GGUF_TYPE_ARRAY) {
ss << "???";
} else {
ss << gguf_data_to_str(arr_type, data, j);
}
if (j < arr_n - 1) {
ss << ", ";
}
}
ss << "]";
return ss.str();
}
default:
return gguf_data_to_str(type, gguf_get_val_data(ctx_gguf, i), 0);
}
}
//
// debugging
//
static void print_tensor_shape(ggml_tensor * t) {
printf("%s.shape = [", t->name);
for (int i = 0; i < ggml_n_dims(t); ++i) {
printf("%" PRId64, t->ne[i]);
if (i < ggml_n_dims(t) - 1) {
printf(", ");
}
}
printf("]\n");
}
static void print_tensor_data(ggml_tensor * t, uint8_t * data, int64_t n) {
ggml_type type = t->type;
int64_t * ne = t->ne;
size_t * nb = t->nb;
for (int64_t i3 = 0; i3 < ne[3]; i3++) {
printf("%s.data: [\n", t->name);
for (int64_t i2 = 0; i2 < ne[2]; i2++) {
if (i2 == n && ne[2] > 2*n) {
printf(" ..., \n");
i2 = ne[2] - n;
}
printf(" [\n");
for (int64_t i1 = 0; i1 < ne[1]; i1++) {
if (i1 == n && ne[1] > 2*n) {
printf(" ..., \n");
i1 = ne[1] - n;
}
printf(" [");
for (int64_t i0 = 0; i0 < ne[0]; i0++) {
if (i0 == n && ne[0] > 2*n) {
printf("..., ");
i0 = ne[0] - n;
}
size_t i = i3 * nb[3] + i2 * nb[2] + i1 * nb[1] + i0 * nb[0];
float v;
if (type == GGML_TYPE_F16) {
v = ggml_fp16_to_fp32(*(ggml_fp16_t *) &data[i]);
} else if (type == GGML_TYPE_F32) {
v = *(float *) &data[i];
} else if (type == GGML_TYPE_I32) {
v = (float) *(int32_t *) &data[i];
} else if (type == GGML_TYPE_I16) {
v = (float) *(int16_t *) &data[i];
} else if (type == GGML_TYPE_I8) {
v = (float) *(int8_t *) &data[i];
} else {
GGML_ABORT("fatal error");
}
printf("%8.4f", v);
if (i0 < ne[0] - 1) printf(", ");
}
printf("],\n");
}
printf(" ],\n");
}
printf(" ]\n");
}
}
//
// API used internally with mtmd
//
projector_type clip_get_projector_type(const struct clip_ctx * ctx);
void clip_set_debug_output_embeddings(struct clip_ctx * ctx, bool debug);