mirror of
https://github.com/ggml-org/llama.cpp.git
synced 2026-06-30 17:47:40 +02:00
24bba7b98e
* add mtmd_image_preproc_out * add dev docs * remove unused clip API * rm unused clip_image_f32_batch::grid * change preprocess() call signature
36 lines
1.8 KiB
Markdown
36 lines
1.8 KiB
Markdown
# libmtmd dev guide
|
|
|
|
## History
|
|
|
|
Please refer to [multimodal.md](../../docs/multimodal.md) for a broader context.
|
|
|
|
In short:
|
|
- `libmtmd` started as a wrapper around `libllava` / `clip.cpp`
|
|
- Various components that used to be in `clip.cpp` are moved progressively to mtmd. For example, preprocessor is now part of mtmd
|
|
|
|
## Terminologies
|
|
|
|
- mtmd: **M**ul**T**i**M**o**D**al
|
|
- bitmap: representing a raw input data, for example: RGB image, PCM audio
|
|
- tiles / slices: for llava-uhd-style models, the preprocessor breaks a large input into smaller square images called tiles or slices
|
|
- chunk: a mtmd_input_chunk represents a preprocessed input that can then be passed through `mtmd_encode()`
|
|
|
|
## Pipeline
|
|
|
|
A typical pipeline of the core libmtmd is as follows:
|
|
- A bitmap (RGB image or PCM audio) is created
|
|
- Bitmap and the text prompt is provided to `mtmd_tokenize()` that breaks the input into chunks
|
|
- The tokenizer function first expands a "lazy" bitmap if it finds one. Typically, this is used by video, so that one media token corresponds to one input bitmap
|
|
- For models that support "fused" temporal frames like Qwen-VL, the tokenizer tries to merge pair of consecutive frames into one batch
|
|
- The preprocessor will then be called, which produces a list of chunks
|
|
- Depending on the model itself, special tokens will be injected to separate image chunks (i.e. llava-uhd-style models)
|
|
- Multiple bitmaps may be batched together to form a larger `mtmd_batch()`
|
|
- Single image or batch is encoded, via `mtmd_encode()` or `mtmd_batch_encode()`
|
|
- Get the output embeddings
|
|
|
|
## Helper
|
|
|
|
We provide a set of helper functions via `mtmd_helper` to make using libmtmd easier. The helper provides:
|
|
- Image, audio and video file decoding (for example, decode raw JPEG into RGB bitmap)
|
|
- Manage `llama_batch` and calls to `llama_decode`
|