mirror of
https://github.com/ggml-org/llama.cpp.git
synced 2026-06-09 07:16:44 +02:00
64086f2b2f
* feat(convert): Get language model conversion working for 4.1 vision Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(convert): Skip multimodal tensors for GraniteMoeHybrid (vision 4.0) Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Disable vocab padding for non-hybrid models that use GraniteMoeHybrid Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Plumb python-side vision projector names and mappings There are several awkward things here: 1. Most of these are essentially identical to the audio qformer tensors. On the c++ side, that's mapped using the prefix, so the rest of the GGUF name needs to align, but on the python side there's no prefix notion, so they all get duplicated. 2. There are a couple of net-new tensors for vision, in particular PROJ_NORM. In both speech and vision, the QF_PROJ_NORM is qualified as belonging to the qformer portion, but the GGUF name is simply proj_norm which conflicts with the ideal name for this new PROJ_NORM that is not qualified as part of the qformer. To get around this, I used "proj_layernorm" as the GGUF name. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add python side architecture name Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add python-side plumbing for setting FEATURE_LAYERS hparam Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side tensor naming defines NOTE: Usage of these hasn't been updated to include prefix yet Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(mtmd): Convert vision_feature_layer to an ordered vector We need to preserve the ordering of these feature index values so that they can be mapped to the sub-tensors within the stacked projectors. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(mtmd): Add architecture label plumbing Branch: Granite4Vision AI-usage: full (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(wip): Add partial conversion for mmproj This handles stacking the projector tensors and setting the new harams Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add gguf_writer and constant support for new hparams and deepstack layer arr Branch: Granite4Vision AI-usage: draft (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Full conversion for mmproj w/ tensor mappings Branch: Granite4Vision AI-usage: full (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add lm_head skip for mmproj for 4.0 Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: De-alias text_config architecture in convert_lora_to_gguf.py Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add --trust-remote-code arg to convert_lora_to_gguf.py This defaults to False, but allows a user to enable it programmaticly instead of using the interactive prompt. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: De-alias model.language_model. -> model. for lora adapters Branch: Granite4Vision AI-usage: full (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Extend language model tensor dealiasing in adapters Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary registration for GraniteSpeech in language model Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Plumb through mm prefix formatting for qformer tensors Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Refactor vision projector tensors to use predictor ID as the block This is cleaner than stacking them. The modeling file hard-codes single-layer qformers, so we can punt on the multiipule multi-layer projectors problem. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add spatial offests array hparam conversion Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add stub plumbing for granite vision in mtmd Branch: Granite4Vision AI-usage: draft (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add new hparam and tensor naming in clip-impl.h New hparams: - KEY_PROJ_SAMPLE_QUERY_SIDE - KEY_PROJ_SAMPLE_WINDOW_SIDE - KEY_PROJ_SPATIAL_OFFSETS New tensors: - TN_MULTI_PROJ_IMG_POS - TN_MULTI_PROJ_QUERY - TN_MULTI_PROJ_LAYERNORM - TN_MULTI_PROJ_LINEAR - TN_MULTI_PROJ_NORM Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Move deepstack_layer_arr to llm hparam instead of mmproj Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove IS_DEEPSTACK_LAYERS This appears to have been added during Qwen3 VL (https://github.com/ggml-org/llama.cpp/pull/16780), but it was never actually used. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: n_deepstack_layers -> deepstack_layer_arr The old logic hard coded a correspondence between the first N layers of the LLM and the 1->N entries in the input embeddings. Now, that relationship is maintained at loading time if the GGUF value is single-valued. If it is multi-valued, it loads directly allowing for deepstack layers to be spaced out throughout the model. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use try/catch for single/multi valued deepstack info The alternative would be to use get_key_or_arr, but then the single value would be populated through the entire array and we'd need to detect that and update it with the right correspondence. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add deepstack injection point for granite LLM The use of ggml_add here assumes that the elements of inp_embd will be pre- arranged to be the full embedding length with only the vision-mask'ed portions non-zero from the projector. This matches how Qwen3VL does it. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: add missing vision attn layernorm eps Branch: Granite4Vision AI-usage: full (OpenCode + Qwen 3.6-35B) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Hoist qformer tensors into qf_block and hold a vector for multi-proj Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix missing prefix template for TN_QF_PROJ_LINEAR It's not strictly necessary since vision uses the blockwise version, but it makes the loading consistent. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add embedding scale and image grid pinpoints hparams in conversion Also remove dead parsing for self._deepstack_layer_arr Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add mtmd KEY_ section for hparams shared with the LLM In this case, we need the EMBEDDING_SCALE so we can unscale the image embeddings to compensate for applying embedding scale to the input embeddings Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Implement c++ hparam parsing Branch: Granite4Vision AI-usage: draft (Claude Code) Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Flatten pinpoints in conversion Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing break Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: No reason to have modality prefix for img_pos Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add tensor loading Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(convert): Fix confusion between proj.norm and proj.qformer.layernorm Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use the right portion of speech for tensor loading! Also plumb through the layernorm -> post_norm naming change Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add logging of deepstack_layers_arr if set I also changed the print_f output type to int32_t to avoid printing overflow values for -1. This could cause overflows on the other side, but I can't imagine a value for any of the current array hparams that would trigger that. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Make sure input embeddings are cont before f_embedding_scale Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add init and mmproj_embd cases for g4v The n_mmproj_embd is 1+ to make space for the text embedding and all 8 projectors Branch: Granite4Vision AI-usage: draft (Bob) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Invert (h, w) -> (w, h) pinpoints Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Reorder projectors based on llm index and skip the first injection The multi-projector stack has a strange asymmetry based on how it's currently implemented for qwen3vl: on the mmproj side, it's all N projectors, but the output of the "first" (by inp_embd index) projector is automatically consumed as if it were a standard single-projector mmproj, so the deepstack portion needs to only contain the 1-N entries. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * fix: Fix mmproj hparams in conversion Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * fix: Fix ordering/logic for deepstack injection in granite Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * fix: Fix preprocessing config to match what the model needs Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * wip: Partial port of Eli's implementation This is still pretty broken, but it's getting closer. It now happily generates tokens, but the values are quite incorrect still. I suspect it's caused by the mapping of projectors from safetensors to their respective orders here. Also, this implementation breaks encapsulation pretty badly in mtmd_encode. This will need a big refactor to put the G4V-specific encoding logic somewhere more appropriate. Branch: Granite4Vision AI-usage: draft (Claude Code, Bob) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * fix: Fix the pre-scaling on the input embeddings to correctly invert the scale We've got tokens! They still don't line up quite right, so something's a little off, but we're getting much closer now. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: invert embedding multiplier -> base_scale at load Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix setting image_resize_pad after new enum introduced Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add G4V to mmproj mapping in conversion Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Re-add padding disable for non-hybrid hybrid models Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Simplify G4V n_tokens computation This is slightly more efficient and flexible for when we implement the unpad cropping. IMO, it's also clearer that it is adding the number of image_newline tokens (embeddings) to the grid, rather than recomputing the entire count. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add new clip APIs for post-tile-encoding assembly Granite 4 Vision uses llava-next style pack-and-unpad which requires injecting the learned newline after each row of the tile grid. A row here is a single row of the grid which is composed of (grid_x * cols_per_tile) * (grid_y * rows_per_tile), so the result is newlines injected in between individual tile rows, thus not something that can be handled with the standard llava-uhd block-wise endcoding. Branch: Granite4Vision AI-usage: draft (Claude Code + Opus 4.7) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add model interfaces for granite 4 vision assembler I'm on the fence about the best organization of this. These free functions allow the per-architecture logic in clip.cpp to access the model-specific graph building, but they still require a fair bit of model-specific logic in clip.cpp which is not ideal. I think a better approach may be to replicate what is done with the graph builders themselves (and possibly even make the assembler part of the model's existing graph builder). Branch: Granite4Vision AI-usage: full (Claude Code + Opus 4.7) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove all g4v-specific branching from mtmd.cpp in favor of clip assembler Branch: Granite4Vision AI-usage: full (Claude Code + Opus 4.7) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor(mtmd): Consolidate assembler logic into clip_assembler class family Just like `clip_graph` is the base class for building the model-specific encoder graphs, `clip_assembler` will be the base class for building the model-specific assembler graphs. This allows the assembly pattern to follow how the encoder pattern is implemented where the model-specific logic lives in a subclass co-located with the encoder graph builder that gets constructed by a simple factory method. Branch: Granite4Vision AI-usage: full (Claude Code + Opus 4.7) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Comment improvement Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: granite_vision -> granite4_vision Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove dead codepath for Qwen3VL add_vision_is_deepstack These pieces were never used on the c++ side (removed there in an earlier commit), so this is just cleanup that I missed before. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Oops! I did not mean to commit one of my prompt files But now it's too far back in history to effectively rebase out, even with interactive and --rebase-merges :( Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing <algorithm> include for std::find It seems that this was already pulled in on some platforms, but not on others Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix Flake8 warnings in granite conversion module Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove clip_assembler in favor of clip_image_f32.append_token Per conversation in the PR, the clip_assembler pattern was too invasive. This is a compromise that limits model-specific blocks to add_media where each preprocessed tile is annotated with an injection type, after which all the token counting logic is generic and the newline injection itself is handled in the graph based on the value for the given tile image. Branch: Granite4Vision AI-usage: draft (Bob, OpenCode + Qwen 3.6 35b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor(convert): Split n_deepstack_layers and deepstack_layers (array) Branch: Granite4Vision AI-usage: full (Bob, OpenCode + Qwen3.6-35b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor(src): Handle n_deepstack_layers and deepstack_layers GGUF keys Branch: Granite4Vision AI-usage: draft (Bob, OpenCode + Qwen3.6-35b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix GGUF key for deepstack_layers_arr Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove pre-scaling embeddings and skip scaling for raw embd inputs This follows how gemma3 and gemma4 handle embedding scaling by skipping the multiplier for raw input embeddings. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: deepstack_layers(_arr) -> deepstack_mapping(_arr) Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Fully revert changes to n_deepstack_layers and qwen3vl* Since we're going to keep the GGUF KVs separate, it makes sense to just keep the hparams separate too to limit the scope of this branch. The down side is that n_deepstack_layers and deepstack_mapping_arr are potentially conflicting. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Revert removal of "is_deepstack_layers" GGUF KV This KV is not used at all on the c++ side, so it's fully dead, but there's also no need to conflate this cleanup with the addition of G4V. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary ggml_cont and build_forward_expand in cbx Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Clean up comments Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Tighter and more flexible code for g4v_build_block This could be refactored to look a lot more like granite-speech, but the overall block constructs before/after the qformer are pretty different, so for now I'm going to leave it as is and just tighten a bit. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary `unordered_set` include Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add architecture guard on deepstack_mapping_arr printout Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary AI-gen comment Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Always initialize deepstack_mapping_arr with -1 values This was causing `test-llama-archs` to fail, likely due to trying to save the uninitialized values, then re-loading them. It's safer to always initialize so that other models don't forget and end up with undefined behavior. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Remove TODO about block/vs non-block tensor mapping Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Move is_vision_feature_layer logic into clip_hparams Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use a bool for append_token Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Remove unnecessary comment Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unused get_model api yikes! Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Rearrange helpers for g4v to be private members and use build_attn Branch: Granite4Vision AI-usage: full (Bob, OpenCode + Qwen3.6-35b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix off-by-one in vision layer index This was inherited from the Claude Code implementation that pushed the negative index inversion down into the model file. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix norm/post_norm mixup in conversion face. palm. :( Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: More descriptive tensor names Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Apply PR cleanup for new conversion changes AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix(convert): Remove duplicate V_ENC_EMBD_IMGNL Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: append_token -> add_newline Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Comment cleanup Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Cleaner error handling/checking NOTE: format_string is not available in granite.cpp (and including clip-impl.h to get it doesn't compile, so I think it violates the intended encapsulation), so std::stringstream is the simplest answer. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
547 lines
23 KiB
Python
Executable File
547 lines
23 KiB
Python
Executable File
#!/usr/bin/env python3
|
|
# -*- coding: utf-8 -*-
|
|
|
|
from __future__ import annotations
|
|
|
|
from dataclasses import dataclass
|
|
import logging
|
|
import argparse
|
|
import os
|
|
import sys
|
|
import json
|
|
from math import prod
|
|
from pathlib import Path
|
|
from typing import TYPE_CHECKING, Any, Callable, Iterable, Iterator, Sequence, SupportsIndex, cast
|
|
from transformers import AutoConfig, AutoTokenizer
|
|
|
|
import torch
|
|
|
|
if TYPE_CHECKING:
|
|
from torch import Tensor
|
|
|
|
if 'NO_LOCAL_GGUF' not in os.environ:
|
|
sys.path.insert(1, str(Path(__file__).parent / 'gguf-py'))
|
|
import gguf
|
|
from gguf.constants import GGUFValueType
|
|
|
|
# reuse model definitions from the conversion/ package
|
|
from conversion import LazyTorchTensor, ModelBase, get_model_class
|
|
|
|
logger = logging.getLogger("lora-to-gguf")
|
|
|
|
|
|
@dataclass
|
|
class PartialLoraTensor:
|
|
A: Tensor | None = None
|
|
B: Tensor | None = None
|
|
|
|
|
|
# magic to support tensor shape modifications and splitting
|
|
class LoraTorchTensor:
|
|
_lora_A: Tensor # (n_rank, row_size)
|
|
_lora_B: Tensor # (col_size, n_rank)
|
|
_rank: int
|
|
|
|
def __init__(self, A: Tensor, B: Tensor):
|
|
assert len(A.shape) == len(B.shape)
|
|
assert A.shape[-2] == B.shape[-1]
|
|
if A.dtype != B.dtype:
|
|
A = A.to(torch.float32)
|
|
B = B.to(torch.float32)
|
|
self._lora_A = A
|
|
self._lora_B = B
|
|
self._rank = B.shape[-1]
|
|
|
|
def get_lora_A_B(self) -> tuple[Tensor, Tensor]:
|
|
return (self._lora_A, self._lora_B)
|
|
|
|
def __getitem__(
|
|
self,
|
|
indices: (
|
|
SupportsIndex
|
|
| slice
|
|
| tuple[SupportsIndex | slice | Tensor, ...] # TODO: add ellipsis in the type signature
|
|
),
|
|
) -> LoraTorchTensor:
|
|
shape = self.shape
|
|
if isinstance(indices, SupportsIndex):
|
|
if len(shape) > 2:
|
|
return LoraTorchTensor(self._lora_A[indices], self._lora_B[indices])
|
|
else:
|
|
raise NotImplementedError # can't return a vector
|
|
elif isinstance(indices, slice):
|
|
if len(shape) > 2:
|
|
return LoraTorchTensor(self._lora_A[indices], self._lora_B[indices])
|
|
else:
|
|
return LoraTorchTensor(self._lora_A, self._lora_B[indices])
|
|
elif isinstance(indices, tuple):
|
|
assert len(indices) > 0
|
|
if indices[-1] is Ellipsis:
|
|
return self[indices[:-1]]
|
|
# expand ellipsis
|
|
indices = tuple(
|
|
u
|
|
for v in (
|
|
(
|
|
(slice(None, None) for _ in range(len(indices) - 1))
|
|
if i is Ellipsis
|
|
else (i,)
|
|
)
|
|
for i in indices
|
|
)
|
|
for u in v
|
|
)
|
|
|
|
if len(indices) < len(shape):
|
|
indices = (*indices, *(slice(None, None) for _ in range(len(indices), len(shape))))
|
|
|
|
# TODO: make sure this is correct
|
|
indices_A = (
|
|
*(
|
|
(
|
|
j.__index__() % self._lora_A.shape[i]
|
|
if isinstance(j, SupportsIndex)
|
|
else slice(None, None)
|
|
)
|
|
for i, j in enumerate(indices[:-2])
|
|
),
|
|
slice(None, None),
|
|
indices[-1],
|
|
)
|
|
indices_B = indices[:-1]
|
|
return LoraTorchTensor(self._lora_A[indices_A], self._lora_B[indices_B])
|
|
else:
|
|
raise NotImplementedError # unknown indice type
|
|
|
|
@property
|
|
def dtype(self) -> torch.dtype:
|
|
assert self._lora_A.dtype == self._lora_B.dtype
|
|
return self._lora_A.dtype
|
|
|
|
@property
|
|
def shape(self) -> tuple[int, ...]:
|
|
assert len(self._lora_A.shape) == len(self._lora_B.shape)
|
|
return (*self._lora_B.shape[:-1], self._lora_A.shape[-1])
|
|
|
|
def size(self, dim=None):
|
|
assert dim is None
|
|
return self.shape
|
|
|
|
def contiguous(self) -> LoraTorchTensor:
|
|
return LoraTorchTensor(
|
|
self._lora_A.contiguous(),
|
|
self._lora_B.contiguous(),
|
|
)
|
|
|
|
def reshape(self, *shape: int | tuple[int, ...]) -> LoraTorchTensor:
|
|
if isinstance(shape[0], tuple):
|
|
new_shape: tuple[int, ...] = shape[0]
|
|
else:
|
|
new_shape = cast(tuple[int, ...], shape)
|
|
orig_shape = self.shape
|
|
if len(new_shape) < 2:
|
|
raise NotImplementedError # can't become a vector
|
|
|
|
# expand -1 in the shape
|
|
if any(dim == -1 for dim in new_shape):
|
|
n_elems = prod(orig_shape)
|
|
n_new_elems = prod(dim if dim != -1 else 1 for dim in new_shape)
|
|
assert n_elems % n_new_elems == 0
|
|
new_shape = (*(dim if dim != -1 else n_elems // n_new_elems for dim in new_shape),)
|
|
|
|
if new_shape[-1] != orig_shape[-1]:
|
|
raise NotImplementedError # can't reshape the row size trivially
|
|
|
|
shape_A = (*(1 for _ in new_shape[:-2]), self._rank, orig_shape[-1])
|
|
shape_B = (*new_shape[:-1], self._rank)
|
|
return LoraTorchTensor(
|
|
self._lora_A.reshape(shape_A),
|
|
self._lora_B.reshape(shape_B),
|
|
)
|
|
|
|
def reshape_as(self, other: Tensor) -> LoraTorchTensor:
|
|
return self.reshape(*other.shape)
|
|
|
|
def view(self, *size: int) -> LoraTorchTensor:
|
|
return self.reshape(*size)
|
|
|
|
def permute(self, *dims: int) -> LoraTorchTensor:
|
|
shape = self.shape
|
|
dims = tuple(dim - len(shape) if dim >= 0 else dim for dim in dims)
|
|
if dims[-1] == -1:
|
|
# TODO: support higher dimensional A shapes bigger than 1
|
|
assert all(dim == 1 for dim in self._lora_A.shape[:-2])
|
|
return LoraTorchTensor(self._lora_A, self._lora_B.permute(*dims))
|
|
if len(shape) == 2 and dims[-1] == -2 and dims[-2] == -1:
|
|
return LoraTorchTensor(self._lora_B.permute(*dims), self._lora_A.permute(*dims))
|
|
else:
|
|
# TODO: compose the above two
|
|
raise NotImplementedError
|
|
|
|
def transpose(self, dim0: int, dim1: int) -> LoraTorchTensor:
|
|
shape = self.shape
|
|
dims = [i for i in range(len(shape))]
|
|
dims[dim0], dims[dim1] = dims[dim1], dims[dim0]
|
|
return self.permute(*dims)
|
|
|
|
def swapaxes(self, axis0: int, axis1: int) -> LoraTorchTensor:
|
|
return self.transpose(axis0, axis1)
|
|
|
|
def split(self, split_size: int | Sequence[int], dim: int = 0) -> tuple[LoraTorchTensor, ...]:
|
|
shape = self.shape
|
|
ndim = len(shape)
|
|
if dim < 0:
|
|
dim += ndim
|
|
if dim == ndim - 1:
|
|
A_chunks = self._lora_A.split(split_size, dim=-1)
|
|
return tuple(LoraTorchTensor(a, self._lora_B) for a in A_chunks)
|
|
elif dim == ndim - 2:
|
|
B_chunks = self._lora_B.split(split_size, dim=-2)
|
|
return tuple(LoraTorchTensor(self._lora_A, b) for b in B_chunks)
|
|
else:
|
|
B_chunks = self._lora_B.split(split_size, dim=dim)
|
|
if self._lora_A.shape[dim] == 1:
|
|
return tuple(LoraTorchTensor(self._lora_A, b) for b in B_chunks)
|
|
A_chunks = self._lora_A.split(split_size, dim=dim)
|
|
return tuple(LoraTorchTensor(a, b) for a, b in zip(A_chunks, B_chunks))
|
|
|
|
def to(self, *args, **kwargs):
|
|
return LoraTorchTensor(self._lora_A.to(*args, **kwargs), self._lora_B.to(*args, **kwargs))
|
|
|
|
def __mul__(self, other) -> LoraTorchTensor:
|
|
# Only output-side multiplication for now
|
|
# W = B @ A, so M_out * W == (M_out * B) @ A
|
|
if not isinstance(other, (int, float)) and other.shape and other.shape[-1] != 1:
|
|
raise NotImplementedError
|
|
return LoraTorchTensor(self._lora_A, self._lora_B * other)
|
|
|
|
def __rmul__(self, other) -> LoraTorchTensor:
|
|
return self * other
|
|
|
|
@classmethod
|
|
def __torch_function__(cls, func: Callable, types, args=(), kwargs=None):
|
|
del types # unused
|
|
|
|
if kwargs is None:
|
|
kwargs = {}
|
|
|
|
if func is torch.permute:
|
|
assert len(args)
|
|
return type(args[0]).permute(*args, **kwargs)
|
|
elif func is torch.reshape:
|
|
assert len(args)
|
|
return type(args[0]).reshape(*args, **kwargs)
|
|
elif func is torch.stack:
|
|
assert len(args)
|
|
assert isinstance(args[0], Sequence)
|
|
dim = kwargs.get("dim", 0)
|
|
assert dim == 0
|
|
return LoraTorchTensor(
|
|
torch.stack([a._lora_A for a in args[0]], dim),
|
|
torch.stack([b._lora_B for b in args[0]], dim),
|
|
)
|
|
elif func is torch.cat:
|
|
assert len(args)
|
|
assert isinstance(args[0], Sequence)
|
|
dim = kwargs.get("dim", 0)
|
|
assert dim == 0
|
|
if len(args[0][0].shape) > 2:
|
|
return LoraTorchTensor(
|
|
torch.cat([a._lora_A for a in args[0]], dim),
|
|
torch.cat([b._lora_B for b in args[0]], dim),
|
|
)
|
|
elif all(torch.equal(args[0][0]._lora_A, t._lora_A) for t in args[0][1:]):
|
|
return LoraTorchTensor(
|
|
args[0][0]._lora_A,
|
|
torch.cat([b._lora_B for b in args[0]], dim),
|
|
)
|
|
else:
|
|
raise NotImplementedError
|
|
elif func is torch.split:
|
|
assert len(args) and len(args) >= 2
|
|
tensor, split_size = args[0], args[1]
|
|
dim = args[2] if len(args) > 2 else kwargs.get("dim", 0)
|
|
return tensor.split(split_size, dim=dim)
|
|
else:
|
|
raise NotImplementedError
|
|
|
|
|
|
def get_base_tensor_name(lora_tensor_name: str) -> str:
|
|
base_name = lora_tensor_name.replace("base_model.model.", "")
|
|
base_name = base_name.replace(".lora_A.weight", ".weight")
|
|
base_name = base_name.replace(".lora_B.weight", ".weight")
|
|
# models produced by mergekit-extract-lora have token embeddings in the adapter
|
|
base_name = base_name.replace(".lora_embedding_A", ".weight")
|
|
base_name = base_name.replace(".lora_embedding_B", ".weight")
|
|
return base_name
|
|
|
|
|
|
def parse_args() -> argparse.Namespace:
|
|
parser = argparse.ArgumentParser(
|
|
description="Convert a Hugging Face PEFT LoRA adapter to a GGUF file")
|
|
parser.add_argument(
|
|
"--outfile", type=Path,
|
|
help="path to write to; default: based on input. {ftype} will be replaced by the outtype.",
|
|
)
|
|
parser.add_argument(
|
|
"--outtype", type=str, choices=["f32", "f16", "bf16", "q8_0", "auto"], default="f32",
|
|
help="output format - use f32 for float32, f16 for float16, bf16 for bfloat16, q8_0 for Q8_0, auto for the highest-fidelity 16-bit float type depending on the first loaded tensor type",
|
|
)
|
|
parser.add_argument(
|
|
"--bigendian", action="store_true",
|
|
help="model is executed on big endian machine",
|
|
)
|
|
parser.add_argument(
|
|
"--no-lazy", action="store_true",
|
|
help="use more RAM by computing all outputs before writing (use in case lazy evaluation is broken)",
|
|
)
|
|
parser.add_argument(
|
|
"--verbose", action="store_true",
|
|
help="increase output verbosity",
|
|
)
|
|
parser.add_argument(
|
|
"--dry-run", action="store_true",
|
|
help="only print out what will be done, without writing any new files",
|
|
)
|
|
parser.add_argument(
|
|
"--base", type=Path,
|
|
help="directory containing Hugging Face model config files (config.json, tokenizer.json) for the base model that the adapter is based on - only config is needed, actual model weights are not required. If base model is unspecified, it will be loaded from Hugging Face hub based on the adapter config",
|
|
)
|
|
parser.add_argument(
|
|
"--base-model-id", type=str,
|
|
help="the model ID of the base model, if it is not available locally or in the adapter config. If specified, it will ignore --base and load the base model config from the Hugging Face hub (Example: 'meta-llama/Llama-3.2-1B-Instruct')",
|
|
)
|
|
parser.add_argument(
|
|
"--trust-remote-code", default=False, action="store_true",
|
|
help="trust remote code in the model",
|
|
)
|
|
parser.add_argument(
|
|
"lora_path", type=Path,
|
|
help="directory containing Hugging Face PEFT LoRA config (adapter_model.json) and weights (adapter_model.safetensors or adapter_model.bin)",
|
|
)
|
|
|
|
return parser.parse_args()
|
|
|
|
|
|
def load_hparams_from_hf(hf_model_id: str, trust_remote_code: bool) -> tuple[dict[str, Any], Path | None]:
|
|
from huggingface_hub import try_to_load_from_cache
|
|
|
|
# normally, adapter does not come with base model config, we need to load it from AutoConfig
|
|
config = AutoConfig.from_pretrained(hf_model_id, trust_remote_code=trust_remote_code)
|
|
cache_dir = try_to_load_from_cache(hf_model_id, "config.json")
|
|
cache_dir = Path(cache_dir).parent if isinstance(cache_dir, str) else None
|
|
|
|
return config.to_dict(), cache_dir
|
|
|
|
|
|
if __name__ == '__main__':
|
|
args = parse_args()
|
|
logging.basicConfig(level=logging.DEBUG if args.verbose else logging.INFO)
|
|
|
|
ftype_map: dict[str, gguf.LlamaFileType] = {
|
|
"f32": gguf.LlamaFileType.ALL_F32,
|
|
"f16": gguf.LlamaFileType.MOSTLY_F16,
|
|
"bf16": gguf.LlamaFileType.MOSTLY_BF16,
|
|
"q8_0": gguf.LlamaFileType.MOSTLY_Q8_0,
|
|
"auto": gguf.LlamaFileType.GUESSED,
|
|
}
|
|
|
|
ftype = ftype_map[args.outtype]
|
|
|
|
dir_base_model: Path | None = args.base
|
|
dir_lora: Path = args.lora_path
|
|
base_model_id: str | None = args.base_model_id
|
|
lora_config = dir_lora / "adapter_config.json"
|
|
input_model = dir_lora / "adapter_model.safetensors"
|
|
|
|
if args.outfile is not None:
|
|
fname_out = args.outfile
|
|
else:
|
|
# output in the same directory as the model by default
|
|
fname_out = dir_lora
|
|
|
|
if os.path.exists(input_model):
|
|
# lazy import load_file only if lora is in safetensors format.
|
|
from safetensors.torch import load_file
|
|
|
|
lora_model = load_file(input_model, device="cpu")
|
|
else:
|
|
input_model = os.path.join(dir_lora, "adapter_model.bin")
|
|
lora_model = torch.load(input_model, map_location="cpu", weights_only=True)
|
|
|
|
# load LoRA config
|
|
with open(lora_config, "r") as f:
|
|
lparams: dict[str, Any] = json.load(f)
|
|
|
|
# load base model
|
|
if base_model_id is not None:
|
|
logger.info(f"Loading base model from Hugging Face: {base_model_id}")
|
|
hparams, dir_base_model = load_hparams_from_hf(base_model_id, args.trust_remote_code)
|
|
elif dir_base_model is None:
|
|
if "base_model_name_or_path" in lparams:
|
|
model_id = lparams["base_model_name_or_path"]
|
|
logger.info(f"Loading base model from Hugging Face: {model_id}")
|
|
try:
|
|
hparams, dir_base_model = load_hparams_from_hf(model_id, args.trust_remote_code)
|
|
except OSError as e:
|
|
logger.error(f"Failed to load base model config: {e}")
|
|
logger.error("Please try downloading the base model and add its path to --base")
|
|
sys.exit(1)
|
|
else:
|
|
logger.error("'base_model_name_or_path' is not found in adapter_config.json")
|
|
logger.error("Base model config is required. Please download the base model and add its path to --base")
|
|
sys.exit(1)
|
|
else:
|
|
logger.info(f"Loading base model: {dir_base_model.name}")
|
|
hparams = ModelBase.load_hparams(dir_base_model, False)
|
|
|
|
with torch.inference_mode():
|
|
try:
|
|
model_arch = hparams.get("text_config", {}).get("architectures", hparams["architectures"])[0]
|
|
logger.info("Using model architecture: %s", model_arch)
|
|
model_class = get_model_class(model_arch)
|
|
except NotImplementedError:
|
|
logger.error(f"Model {hparams['architectures'][0]} is not supported")
|
|
sys.exit(1)
|
|
|
|
class LoraModel(model_class): # ty: ignore[unsupported-base]
|
|
model_arch = model_class.model_arch
|
|
|
|
lora_alpha: float
|
|
|
|
def __init__(self, *args, dir_lora_model: Path, lora_alpha: float, **kwargs):
|
|
|
|
super().__init__(*args, **kwargs)
|
|
|
|
self.dir_model_card = dir_lora_model
|
|
self.lora_alpha = float(lora_alpha)
|
|
|
|
def set_vocab(self):
|
|
pass
|
|
|
|
def set_type(self):
|
|
self.gguf_writer.add_type(gguf.GGUFType.ADAPTER)
|
|
self.gguf_writer.add_string(gguf.Keys.Adapter.TYPE, "lora")
|
|
|
|
def set_gguf_parameters(self):
|
|
logger.debug("GGUF KV: %s = %d", gguf.Keys.Adapter.LORA_ALPHA, self.lora_alpha)
|
|
self.gguf_writer.add_float32(gguf.Keys.Adapter.LORA_ALPHA, self.lora_alpha)
|
|
alora_invocation_tokens = lparams.get("alora_invocation_tokens")
|
|
invocation_string = lparams.get("invocation_string")
|
|
if invocation_string and not alora_invocation_tokens:
|
|
logger.debug("Tokenizing invocation_string -> alora_invocation_tokens")
|
|
base_model_path_or_id = hparams.get("_name_or_path")
|
|
try:
|
|
tokenizer = AutoTokenizer.from_pretrained(base_model_path_or_id)
|
|
except ValueError:
|
|
logger.error("Unable to load tokenizer from %s", base_model_path_or_id)
|
|
raise
|
|
# NOTE: There's an off-by-one with the older aLoRAs where
|
|
# the invocation string includes the "<|start_of_turn|>"
|
|
# token, but the adapters themselves were trained to
|
|
# activate _after_ that first token, so we drop it here.
|
|
alora_invocation_tokens = tokenizer(invocation_string)["input_ids"][1:] # ty: ignore[call-non-callable]
|
|
if alora_invocation_tokens:
|
|
logger.debug("GGUF KV: %s = %s", gguf.Keys.Adapter.ALORA_INVOCATION_TOKENS, alora_invocation_tokens)
|
|
self.gguf_writer.add_key_value(
|
|
gguf.Keys.Adapter.ALORA_INVOCATION_TOKENS,
|
|
alora_invocation_tokens,
|
|
GGUFValueType.ARRAY,
|
|
GGUFValueType.UINT32,
|
|
)
|
|
|
|
def generate_extra_tensors(self) -> Iterable[tuple[str, Tensor]]:
|
|
# Never add extra tensors (e.g. rope_freqs) for LoRA adapters
|
|
return ()
|
|
|
|
def get_tensors(self) -> Iterator[tuple[str, Tensor]]:
|
|
tensor_map: dict[str, PartialLoraTensor] = {}
|
|
|
|
for name, tensor in lora_model.items():
|
|
if self.lazy:
|
|
tensor = LazyTorchTensor.from_eager(tensor)
|
|
base_name = get_base_tensor_name(name)
|
|
# filter base name, ignore tensor transformations for now
|
|
data_gen = lambda g=tensor: g # noqa: E731
|
|
if (titem := self.filter_tensors((base_name, data_gen))) is None:
|
|
continue
|
|
base_name, _ = titem
|
|
# note: mergekit-extract-lora also adds token embeddings to the adapter
|
|
is_lora_a = ".lora_A.weight" in name or ".lora_embedding_A" in name
|
|
is_lora_b = ".lora_B.weight" in name or ".lora_embedding_B" in name
|
|
if not is_lora_a and not is_lora_b:
|
|
if ".base_layer.weight" in name:
|
|
continue
|
|
# mergekit-extract-lora add these layernorm to the adapter, we need to keep them
|
|
if "_layernorm" in name or ".norm" in name:
|
|
yield (base_name, tensor)
|
|
continue
|
|
logger.error(f"Unexpected name '{name}': Not a lora_A or lora_B tensor")
|
|
if ".embed_tokens.weight" in name or ".lm_head.weight" in name:
|
|
logger.error("Embeddings is present in the adapter. This can be due to new tokens added during fine tuning")
|
|
logger.error("Please refer to https://github.com/ggml-org/llama.cpp/pull/9948")
|
|
sys.exit(1)
|
|
|
|
if base_name in tensor_map:
|
|
if is_lora_a:
|
|
tensor_map[base_name].A = tensor
|
|
else:
|
|
tensor_map[base_name].B = tensor
|
|
else:
|
|
if is_lora_a:
|
|
tensor_map[base_name] = PartialLoraTensor(A=tensor)
|
|
else:
|
|
tensor_map[base_name] = PartialLoraTensor(B=tensor)
|
|
|
|
for name, tensor in tensor_map.items():
|
|
assert tensor.A is not None
|
|
assert tensor.B is not None
|
|
yield (name, cast(torch.Tensor, LoraTorchTensor(tensor.A, tensor.B)))
|
|
|
|
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
|
|
dest = list(super().modify_tensors(data_torch, name, bid))
|
|
# some archs may have the same tensor for lm_head and output (tie word embeddings)
|
|
# in this case, adapters targeting lm_head will fail when using llama-export-lora
|
|
# therefore, we ignore them for now
|
|
# see: https://github.com/ggml-org/llama.cpp/issues/9065
|
|
if name == "lm_head.weight" and len(dest) == 0:
|
|
raise ValueError("lm_head is present in adapter, but is ignored in base model")
|
|
for dest_name, dest_data in dest:
|
|
# mergekit-extract-lora add these layernorm to the adapter
|
|
if "_norm" in dest_name:
|
|
assert dest_data.dim() == 1
|
|
yield (dest_name, dest_data)
|
|
continue
|
|
|
|
# otherwise, we must get the lora_A and lora_B tensors
|
|
assert isinstance(dest_data, LoraTorchTensor)
|
|
lora_a, lora_b = dest_data.get_lora_A_B()
|
|
|
|
# note: mergekit-extract-lora flip and transpose A and B
|
|
# here we only need to transpose token_embd.lora_a, see llm_build_inp_embd()
|
|
if "token_embd.weight" in dest_name:
|
|
lora_a = lora_a.T
|
|
|
|
yield (dest_name + ".lora_a", lora_a)
|
|
yield (dest_name + ".lora_b", lora_b)
|
|
|
|
alpha: float = lparams["lora_alpha"]
|
|
|
|
model_instance = LoraModel(
|
|
dir_base_model,
|
|
ftype,
|
|
fname_out,
|
|
is_big_endian=args.bigendian,
|
|
use_temp_file=False,
|
|
eager=args.no_lazy,
|
|
dry_run=args.dry_run,
|
|
dir_lora_model=dir_lora,
|
|
lora_alpha=alpha,
|
|
hparams=hparams,
|
|
remote_hf_model_id=base_model_id,
|
|
)
|
|
|
|
logger.info("Exporting model...")
|
|
model_instance.write()
|
|
logger.info(f"Model successfully exported to {model_instance.fname_out}")
|