Compare commits

...

26 Commits

Author SHA1 Message Date
0cc4m c3f9d25706 Vulkan: Fix float16 use on devices without float16 support + fix subgroup_size_control validation error (#11161)
* Vulkan: Remove float16 use in shaders

* Fix validation error about subgroup_size_control extension
2025-01-10 06:39:33 +01:00
Molly Sophia ee7136c6d1 llama: add support for QRWKV6 model architecture (#11001)
llama: add support for QRWKV6 model architecture (#11001)

* WIP: Add support for RWKV6Qwen2

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV: Some graph simplification

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Add support for RWKV6Qwen2 with cpu and cuda GLA

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Fix some typos

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* code format changes

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Fix wkv test & add gla test

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Fix cuda warning

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Update README.md

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Update ggml/src/ggml-cuda/gla.cu

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Fix fused lerp weights loading with RWKV6

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* better sanity check skipping for QRWKV6 in llama-quant

thanks @compilade

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: compilade <git@compilade.net>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: compilade <git@compilade.net>
2025-01-10 09:58:08 +08:00
Akarshan Biswas c6860cc734 SYCL: Refactor ggml_sycl_compute_forward (#11121)
* SYCL: refactor ggml_sycl_compute_forward

* SYCL: add back GGML_USED(dst) to ggml_sycl_cpy

* SYCL: add function name to noop debug

* SYCL: Some device info print refactoring and add details of XMX availability
2025-01-10 08:13:03 +08:00
Tei Home 1204f97270 doc: add cuda guide for fedora (#11135)
Since NVIDIA does not release CUDA for in-maintenance versions of Fedora, the process of setting up the CUDA toolkit on Fedora has become quite involved. This guide should help mere mortals install CUDA for development in a Fedora 39 toolbox environment, without affecting the host system.
2025-01-09 11:32:06 +00:00
Daniel Bevenius 8eceb888d7 server : add tooltips to settings and themes btn (#11154)
* server : add tooltips to settings and themes btn

This commit adds tooltips to the settings and themes buttons in the
webui. The tooltip will be displayed below the actual buttons when
hovered over.

The motivation for this change is to clarify the purpose of the themes
button.

* squash! server : add tooltips to settings and themes btn

This commit adds a tooltip to the '...' button when a chat has been
started. The tooltip is "Chat options" which think could be a good
description as the dropdown contains options to delete or download the
current chat.

* rm tooltip for 3 dots button

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-01-09 11:28:29 +01:00
Pierrick Hymbert f8feb4b01a model: Add support for PhiMoE arch (#11003)
* model: support phimoe

* python linter

* doc: minor

Co-authored-by: ThiloteE <73715071+ThiloteE@users.noreply.github.com>

* doc: minor

Co-authored-by: ThiloteE <73715071+ThiloteE@users.noreply.github.com>

* doc: add phimoe as supported model

ggml-ci

---------

Co-authored-by: ThiloteE <73715071+ThiloteE@users.noreply.github.com>
2025-01-09 11:21:41 +01:00
Georgi Gerganov be0e950c91 media : remove old img [no ci] 2025-01-09 11:15:15 +02:00
Xuan Son Nguyen d9feae1c06 llama-chat : add phi 4 template (#11148) 2025-01-09 10:07:33 +01:00
hydai 8d59d91171 fix: add missing msg in static_assert (#11143)
Signed-off-by: hydai <z54981220@gmail.com>
2025-01-08 20:03:28 +00:00
Vinesh Janarthanan 8a1d9c25fa gguf-py : move scripts directory (#11116)
* Moved scripts dir and fixed pyproject.toml

* updated readme

* fixed README urls

* bump pypi gguf to v0.14.0

* retrigger ci

* empty commit - trigger ci
2025-01-08 20:54:58 +02:00
Eric Curtin 1bf839b1e8 Enhance user input handling for llama-run (#11138)
The main motivation for this change is it was not handing
ctrl-c/ctrl-d correctly. Modify `read_user_input` to handle EOF,
"/bye" command, and empty input cases. Introduce `get_user_input`
function to manage user input loop and handle different return
cases.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-01-08 18:47:05 +00:00
Xuan Son Nguyen f7cd13301c ci : use actions from ggml-org (#11140) 2025-01-08 16:09:20 +01:00
Xuan Son Nguyen 4d2b3d8804 lora : improve compat with mergekit-extract-lora (#11131)
* (wip) support mergekit-extracted lora

* support mergekit-extract-lora

* use lora->get_scale

* correct comment

* correct norm name & condition

* add some hints
2025-01-08 15:59:53 +01:00
Georgi Gerganov c07d437bbd llama : avoid hardcoded QK_K (#11061)
ggml-ci
2025-01-08 16:19:36 +02:00
Georgi Gerganov 99a3755a3c sync : ggml 2025-01-08 13:40:30 +02:00
Radoslav Gerganov c792dcf488 ggml : allow loading backend with env variable (ggml/1059)
ref: #1058
2025-01-08 13:40:18 +02:00
Xuan Son Nguyen 80ccf5d725 ci : pin dependency to specific version (#11137)
* ci : pin dependency to specific version

* will this fix ec?
2025-01-08 12:07:20 +01:00
Georgi Gerganov a3c1232c3f arg : option to exclude arguments from specific examples (#11136)
* arg : option to exclude arguments from specific examples

ggml-ci

* readme : remove old args [no ci]
2025-01-08 12:55:36 +02:00
amritahs-ibm 8cef75c743 llamafile : ppc64le MMA INT8 implementation (#10912)
This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.

This change results in 10% - 70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>
2025-01-08 12:54:19 +02:00
Georgi Gerganov 0d52a69e4b ci : fix cmake option (#11125) 2025-01-08 11:29:34 +02:00
Mathieu Baudier 02f0430141 Disable GL_KHR_cooperative_matrix Vulkan extension if not available. (#11117)
* Disable GL_KHR_cooperative_matrix Vulkan extension if not available.

* Perform Vulkan extensions checks in a more sensible order

* Remove unnecessary #ifdef directive
2025-01-08 09:18:13 +01:00
ag2s20150909 bec2183f2c fix: Vulkan shader gen binary path when Cross-compiling (#11096)
* fix: Vulkan shader gen binary path when cross compiling
2025-01-08 09:17:29 +01:00
Johannes Gäßler 53ff6b9b9f GGUF: C++ refactor, backend support, misc fixes (#11030)
* GGUF: C++ refactor, backend support, misc fixes

remove ggml_tensor.backend

update CODEOWNERS [no ci]

remove gguf_get_data from API

revise GGUF API data types
2025-01-07 18:01:58 +01:00
Diego Devesa 017cc5f446 ggml-backend : only offload from host buffers (fix) (#11124) 2025-01-07 16:11:57 +01:00
Diego Devesa a3d50bc022 ggml-backend : only offload from host buffers (#11120) 2025-01-07 12:38:05 +01:00
Radoslav Gerganov a4dd490069 rpc : code cleanup (#11107)
Remove duplicated macros, use GGML_LOG_ERROR for errors
2025-01-07 08:37:02 +02:00
100 changed files with 4534 additions and 2252 deletions
+2 -2
View File
@@ -665,7 +665,7 @@ jobs:
- build: 'llvm-arm64'
defines: '-G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON'
- build: 'msvc-arm64'
defines: '-G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-msvc.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DBUILD_SHARED_LIBS=O'
defines: '-G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-msvc.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON'
- build: 'llvm-arm64-opencl-adreno'
defines: '-G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON'
@@ -1237,7 +1237,7 @@ jobs:
- name: Create release
id: create_release
uses: anzz1/action-create-release@v1
uses: ggml-org/action-create-release@v1
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
+1 -2
View File
@@ -97,10 +97,9 @@ jobs:
GITHUB_BRANCH_NAME: ${{ github.head_ref || github.ref_name }}
GITHUB_REPOSITORY_OWNER: '${{ github.repository_owner }}'
# https://github.com/jlumbroso/free-disk-space/tree/54081f138730dfa15788a46383842cd2f914a1be#example
- name: Free Disk Space (Ubuntu)
if: ${{ matrix.config.free_disk_space == true }}
uses: jlumbroso/free-disk-space@main
uses: ggml-org/free-disk-space@v1.3.1
with:
# this might remove tools that are actually needed,
# if set to "true" but frees about 6 GB
+3 -1
View File
@@ -23,5 +23,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: editorconfig-checker/action-editorconfig-checker@main
- uses: editorconfig-checker/action-editorconfig-checker@v2
with:
version: v3.0.3
- run: editorconfig-checker
+6
View File
@@ -3,3 +3,9 @@
/ci/ @ggerganov
/.devops/*.Dockerfile @ngxson
/examples/server/ @ngxson
/ggml/src/ggml-cuda/fattn* @JohannesGaessler
/ggml/src/ggml-cuda/mmq.* @JohannesGaessler
/ggml/src/ggml-cuda/mmv.* @JohannesGaessler
/ggml/src/ggml-cuda/mmvq.* @JohannesGaessler
/ggml/src/ggml-opt.cpp @JohannesGaessler
/ggml/src/gguf.cpp @JohannesGaessler
+2
View File
@@ -69,6 +69,7 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
- [x] [Qwen models](https://huggingface.co/models?search=Qwen/Qwen)
- [x] [PLaMo-13B](https://github.com/ggerganov/llama.cpp/pull/3557)
- [x] [Phi models](https://huggingface.co/models?search=microsoft/phi)
- [x] [PhiMoE](https://github.com/ggerganov/llama.cpp/pull/11003)
- [x] [GPT-2](https://huggingface.co/gpt2)
- [x] [Orion 14B](https://github.com/ggerganov/llama.cpp/pull/5118)
- [x] [InternLM2](https://huggingface.co/models?search=internlm2)
@@ -98,6 +99,7 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
- [x] [Jais](https://huggingface.co/inceptionai/jais-13b-chat)
- [x] [Bielik-11B-v2.3](https://huggingface.co/collections/speakleash/bielik-11b-v23-66ee813238d9b526a072408a)
- [x] [RWKV-6](https://github.com/BlinkDL/RWKV-LM)
- [x] [QRWKV-6](https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1)
- [x] [GigaChat-20B-A3B](https://huggingface.co/ai-sage/GigaChat-20B-A3B-instruct)
#### Multimodal
+13 -4
View File
@@ -22,6 +22,11 @@ common_arg & common_arg::set_examples(std::initializer_list<enum llama_example>
return *this;
}
common_arg & common_arg::set_excludes(std::initializer_list<enum llama_example> excludes) {
this->excludes = std::move(excludes);
return *this;
}
common_arg & common_arg::set_env(const char * env) {
help = help + "\n(env: " + env + ")";
this->env = env;
@@ -37,6 +42,10 @@ bool common_arg::in_example(enum llama_example ex) {
return examples.find(ex) != examples.end();
}
bool common_arg::is_exclude(enum llama_example ex) {
return excludes.find(ex) != excludes.end();
}
bool common_arg::get_value_from_env(std::string & output) {
if (env == nullptr) return false;
char * value = std::getenv(env);
@@ -420,7 +429,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
* - if both {LLAMA_EXAMPLE_COMMON, LLAMA_EXAMPLE_*,} are set, we will prioritize the LLAMA_EXAMPLE_* matching current example
*/
auto add_opt = [&](common_arg arg) {
if (arg.in_example(ex) || arg.in_example(LLAMA_EXAMPLE_COMMON)) {
if ((arg.in_example(ex) || arg.in_example(LLAMA_EXAMPLE_COMMON)) && !arg.is_exclude(ex)) {
ctx_arg.options.push_back(std::move(arg));
}
};
@@ -649,7 +658,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
[](common_params & params, const std::string & value) {
params.prompt = value;
}
));
).set_excludes({LLAMA_EXAMPLE_SERVER}));
add_opt(common_arg(
{"--no-perf"},
string_format("disable internal libllama performance timings (default: %s)", params.no_perf ? "true" : "false"),
@@ -673,7 +682,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
params.prompt.pop_back();
}
}
));
).set_excludes({LLAMA_EXAMPLE_SERVER}));
add_opt(common_arg(
{"--in-file"}, "FNAME",
"an input file (repeat to specify multiple files)",
@@ -700,7 +709,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
params.prompt = ss.str();
fprintf(stderr, "Read %zu bytes from binary file %s\n", params.prompt.size(), value.c_str());
}
));
).set_excludes({LLAMA_EXAMPLE_SERVER}));
add_opt(common_arg(
{"-e", "--escape"},
string_format("process escapes sequences (\\n, \\r, \\t, \\', \\\", \\\\) (default: %s)", params.escape ? "true" : "false"),
+3
View File
@@ -12,6 +12,7 @@
struct common_arg {
std::set<enum llama_example> examples = {LLAMA_EXAMPLE_COMMON};
std::set<enum llama_example> excludes = {};
std::vector<const char *> args;
const char * value_hint = nullptr; // help text or example for arg value
const char * value_hint_2 = nullptr; // for second arg value
@@ -53,9 +54,11 @@ struct common_arg {
) : args(args), value_hint(value_hint), value_hint_2(value_hint_2), help(help), handler_str_str(handler) {}
common_arg & set_examples(std::initializer_list<enum llama_example> examples);
common_arg & set_excludes(std::initializer_list<enum llama_example> excludes);
common_arg & set_env(const char * env);
common_arg & set_sparam();
bool in_example(enum llama_example ex);
bool is_exclude(enum llama_example ex);
bool get_value_from_env(std::string & output);
bool has_value_from_env();
std::string to_string();
+3
View File
@@ -2,6 +2,9 @@
#define _SILENCE_CXX17_CODECVT_HEADER_DEPRECATION_WARNING
#endif
#include "ggml.h"
#include "gguf.h"
#include "common.h"
#include "log.h"
// Change JSON_ASSERT from assert() to GGML_ASSERT:
+134 -4
View File
@@ -326,6 +326,7 @@ class Model:
gguf.MODEL_TENSOR.TIME_MIX_W2,
gguf.MODEL_TENSOR.TIME_MIX_DECAY_W1,
gguf.MODEL_TENSOR.TIME_MIX_DECAY_W2,
gguf.MODEL_TENSOR.TIME_MIX_LERP_FUSED,
gguf.MODEL_TENSOR.POSNET_NORM1,
gguf.MODEL_TENSOR.POSNET_NORM2,
)
@@ -2562,6 +2563,63 @@ class Phi3MiniModel(Model):
yield (self.format_tensor_name(gguf.MODEL_TENSOR.ROPE_FACTORS_SHORT), torch.tensor(short_factors, dtype=torch.float32))
@Model.register("PhiMoEForCausalLM")
class PhiMoeModel(Phi3MiniModel):
model_arch = gguf.MODEL_ARCH.PHIMOE
_experts: list[dict[str, Tensor]] | None = None
def set_gguf_parameters(self):
super().set_gguf_parameters()
self.gguf_writer.add_expert_used_count(self.hparams["num_experts_per_tok"])
self.gguf_writer.add_expert_count(self.hparams["num_local_experts"])
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
# process the experts separately
if name.find("block_sparse_moe.experts") != -1:
n_experts = self.hparams["num_local_experts"]
assert bid is not None
if self._experts is None:
self._experts = [{} for _ in range(self.block_count)]
self._experts[bid][name] = data_torch
if len(self._experts[bid]) >= n_experts * 3:
tensors: list[tuple[str, Tensor]] = []
# merge the experts into a single 3d tensor
for w_name in ["w1", "w2", "w3"]:
datas: list[Tensor] = []
for xid in range(n_experts):
ename = f"model.layers.{bid}.block_sparse_moe.experts.{xid}.{w_name}.weight"
datas.append(self._experts[bid][ename])
del self._experts[bid][ename]
data_torch = torch.stack(datas, dim=0)
merged_name = f"model.layers.{bid}.block_sparse_moe.experts.{w_name}.weight"
new_name = self.map_tensor_name(merged_name)
tensors.append((new_name, data_torch))
return tensors
else:
return []
return [(self.map_tensor_name(name), data_torch)]
def prepare_tensors(self):
super().prepare_tensors()
if self._experts is not None:
# flatten `list[dict[str, Tensor]]` into `list[str]`
experts = [k for d in self._experts for k in d.keys()]
if len(experts) > 0:
raise ValueError(f"Unprocessed experts: {experts}")
@Model.register("PlamoForCausalLM")
class PlamoModel(Model):
model_arch = gguf.MODEL_ARCH.PLAMO
@@ -3259,6 +3317,8 @@ class Rwkv6Model(Model):
# required by llama.cpp, unused
self.gguf_writer.add_head_count(0)
lerp_weights: dict[int, dict[str, Tensor]] = {}
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
new_name = self.map_tensor_name(name)
@@ -3274,14 +3334,84 @@ class Rwkv6Model(Model):
if new_name.endswith("time_mix_decay.weight") or "lerp" in new_name:
data_torch = data_torch.squeeze()
rescale_every_n_layers = self.hparams["rescale_every"]
if rescale_every_n_layers > 0:
if new_name.endswith("time_mix_output.weight") or new_name.endswith("channel_mix_value.weight"):
data_torch = data_torch.div_(2 ** int(bid // rescale_every_n_layers))
try:
rescale_every_n_layers = self.hparams["rescale_every"]
if rescale_every_n_layers > 0:
if new_name.endswith("time_mix_output.weight") or new_name.endswith("channel_mix_value.weight"):
data_torch = data_torch.div_(2 ** int(bid // rescale_every_n_layers))
except KeyError:
pass
# concat time_mix_lerp weights to reduce some cpu overhead
# also reduces the number of tensors in the model
if bid is not None and "time_mix_lerp" in new_name and "time_mix_lerp_x" not in new_name:
try:
self.lerp_weights[bid][new_name] = data_torch
except KeyError:
self.lerp_weights[bid] = {new_name: data_torch}
if all(f"blk.{bid}.time_mix_lerp_{i}.weight" in self.lerp_weights[bid].keys() for i in ["w", "k", "v", "r", "g"]):
new_name = f"blk.{bid}.time_mix_lerp_fused.weight"
data = torch.stack([self.lerp_weights[bid][f"blk.{bid}.time_mix_lerp_{i}.weight"].unsqueeze(0) for i in ["w", "k", "v", "r", "g"]], dim=0).unsqueeze(1)
yield (new_name, data)
return
yield (new_name, data_torch)
@Model.register("RWKV6Qwen2ForCausalLM")
class RWKV6Qwen2Model(Rwkv6Model):
model_arch = gguf.MODEL_ARCH.RWKV6QWEN2
def set_vocab(self):
try:
self._set_vocab_sentencepiece()
except FileNotFoundError:
self._set_vocab_gpt2()
def set_gguf_parameters(self):
block_count = self.hparams["num_hidden_layers"]
num_attention_heads = self.hparams["num_attention_heads"]
num_key_value_heads = self.hparams["num_key_value_heads"]
hidden_size = self.hparams["hidden_size"]
head_size = hidden_size // num_attention_heads
rms_norm_eps = self.hparams["rms_norm_eps"]
intermediate_size = self.hparams["intermediate_size"]
time_mix_extra_dim = 64 if hidden_size >= 4096 else 32
time_decay_extra_dim = 128 if hidden_size >= 4096 else 64
# RWKV isn't context limited
self.gguf_writer.add_context_length(1048576)
self.gguf_writer.add_embedding_length(hidden_size)
self.gguf_writer.add_block_count(block_count)
self.gguf_writer.add_wkv_head_size(head_size)
self.gguf_writer.add_time_mix_extra_dim(time_mix_extra_dim)
self.gguf_writer.add_time_decay_extra_dim(time_decay_extra_dim)
self.gguf_writer.add_feed_forward_length(intermediate_size)
self.gguf_writer.add_file_type(self.ftype)
# special parameters for time_mixing in RWKV6QWEN2
self.gguf_writer.add_layer_norm_rms_eps(rms_norm_eps)
self.gguf_writer.add_token_shift_count(1)
# RWKV6QWEN2 use grouped key/value like GQA
self.gguf_writer.add_head_count_kv(num_key_value_heads)
# required by llama.cpp, unused
self.gguf_writer.add_head_count(0)
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
for new_name, data in super().modify_tensors(data_torch, name, bid):
if "time_mix_w1" in new_name or "time_mix_w2" in new_name:
data = data.view(5, -1, data.shape[-1])
# rwkv6qwen2 has a different order of rkvwg instead of the original wkvrg
# permute them here to avoid code changes
data = torch.stack([data[3], data[1], data[2], data[0], data[4]], dim=0).view(-1, data.shape[-1])
if "w2" in new_name:
data = data.view(5, -1, data.shape[-1])
yield (new_name, data)
continue
yield (new_name, data)
@Model.register("MambaForCausalLM", "MambaLMHeadModel", "FalconMambaForCausalLM")
class MambaModel(Model):
model_arch = gguf.MODEL_ARCH.MAMBA
+31 -3
View File
@@ -226,6 +226,9 @@ def get_base_tensor_name(lora_tensor_name: str) -> str:
base_name = lora_tensor_name.replace("base_model.model.", "")
base_name = base_name.replace(".lora_A.weight", ".weight")
base_name = base_name.replace(".lora_B.weight", ".weight")
# models produced by mergekit-extract-lora have token embeddings in the adapter
base_name = base_name.replace(".lora_embedding_A", ".weight")
base_name = base_name.replace(".lora_embedding_B", ".weight")
return base_name
@@ -260,6 +263,10 @@ def parse_args() -> argparse.Namespace:
"--base", type=Path,
help="directory containing Hugging Face model config files (config.json, tokenizer.json) for the base model that the adapter is based on - only config is needed, actual model weights are not required. If base model is unspecified, it will be loaded from Hugging Face hub based on the adapter config",
)
parser.add_argument(
"--base-model-id", type=str,
help="the model ID of the base model, if it is not available locally or in the adapter config. If specified, it will ignore --base and load the base model config from the Hugging Face hub (Example: 'meta-llama/Llama-3.2-1B-Instruct')",
)
parser.add_argument(
"lora_path", type=Path,
help="directory containing Hugging Face PEFT LoRA config (adapter_model.json) and weights (adapter_model.safetensors or adapter_model.bin)",
@@ -290,6 +297,7 @@ if __name__ == '__main__':
dir_base_model: Path | None = args.base
dir_lora: Path = args.lora_path
base_model_id: str | None = args.base_model_id
lora_config = dir_lora / "adapter_config.json"
input_model = dir_lora / "adapter_model.safetensors"
@@ -313,7 +321,10 @@ if __name__ == '__main__':
lparams: dict[str, Any] = json.load(f)
# load base model
if dir_base_model is None:
if base_model_id is not None:
logger.info(f"Loading base model from Hugging Face: {base_model_id}")
hparams = load_hparams_from_hf(base_model_id)
elif dir_base_model is None:
if "base_model_name_or_path" in lparams:
model_id = lparams["base_model_name_or_path"]
logger.info(f"Loading base model from Hugging Face: {model_id}")
@@ -371,11 +382,16 @@ if __name__ == '__main__':
if self.lazy:
tensor = LazyTorchTensor.from_eager(tensor)
base_name = get_base_tensor_name(name)
is_lora_a = ".lora_A.weight" in name
is_lora_b = ".lora_B.weight" in name
# note: mergekit-extract-lora also adds token embeddings to the adapter
is_lora_a = ".lora_A.weight" in name or ".lora_embedding_A" in name
is_lora_b = ".lora_B.weight" in name or ".lora_embedding_B" in name
if not is_lora_a and not is_lora_b:
if ".base_layer.weight" in name:
continue
# mergekit-extract-lora add these layernorm to the adapter, we need to keep them
if "_layernorm" in name or ".norm" in name:
yield (base_name, tensor)
continue
logger.error(f"Unexpected name '{name}': Not a lora_A or lora_B tensor")
if ".embed_tokens.weight" in name or ".lm_head.weight" in name:
logger.error("Embeddings is present in the adapter. This can be due to new tokens added during fine tuning")
@@ -407,9 +423,21 @@ if __name__ == '__main__':
if name == "lm_head.weight" and len(dest) == 0:
raise ValueError("lm_head is present in adapter, but is ignored in base model")
for dest_name, dest_data in dest:
# mergekit-extract-lora add these layernorm to the adapter
if "_norm" in dest_name:
assert dest_data.dim() == 1
yield (dest_name, dest_data)
continue
# otherwise, we must get the lora_A and lora_B tensors
assert isinstance(dest_data, LoraTorchTensor)
lora_a, lora_b = dest_data.get_lora_A_B()
# note: mergekit-extract-lora flip and transpose A and B
# here we only need to transpose token_embd.lora_a, see llm_build_inp_embd()
if "token_embd.weight" in dest_name:
lora_a = lora_a.T
yield (dest_name + ".lora_a", lora_a)
yield (dest_name + ".lora_b", lora_b)
+2
View File
@@ -127,6 +127,8 @@ For detailed info, please refer to [llama.cpp for SYCL](./backend/SYCL.md).
This provides GPU acceleration using an NVIDIA GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager (e.g. `apt install nvidia-cuda-toolkit`) or from the [NVIDIA developer site](https://developer.nvidia.com/cuda-downloads).
If you are using Fedora (using Fedora Workstation, or an 'Atomic' variant such as Silverblue), or would like to set up CUDA in a toolbox, please consider our [Fedora CUDA guide](./cuda-fedora.md). Unfortunately, the process is not as simple as one might expect.
- Using `CMake`:
```bash
+317
View File
@@ -0,0 +1,317 @@
# Setting Up CUDA on Fedora
In this guide we setup [Nvidia CUDA](https://docs.nvidia.com/cuda/) in a toolbox container. This guide is applicable for:
- [Fedora Workstation](https://fedoraproject.org/workstation/)
- [Atomic Desktops for Fedora](https://fedoraproject.org/atomic-desktops/)
- [Fedora Spins](https://fedoraproject.org/spins)
- [Other Distributions](https://containertoolbx.org/distros/), including `Red Hat Enterprise Linux >= 8.`, `Arch Linux`, and `Ubuntu`.
## Table of Contents
- [Prerequisites](#prerequisites)
- [Monitoring NVIDIA CUDA Repositories](#monitoring-nvidia-cuda-repositories)
- [Using the Fedora 39 CUDA Repository](#using-the-fedora-39-cuda-repository)
- [Creating a Fedora Toolbox Environment](#creating-a-fedora-toolbox-environment)
- [Installing Essential Development Tools](#installing-essential-development-tools)
- [Adding the CUDA Repository](#adding-the-cuda-repository)
- [Installing `nvidia-driver-libs`](#installing-nvidia-driver-libs)
- [Manually Resolving Package Conflicts](#manually-resolving-package-conflicts)
- [Finalizing the Installation of `nvidia-driver-libs`](#finalizing-the-installation-of-nvidia-driver-libs)
- [Installing the CUDA Meta-Package](#installing-the-cuda-meta-package)
- [Configuring the Environment](#configuring-the-environment)
- [Verifying the Installation](#verifying-the-installation)
- [Conclusion](#conclusion)
- [Troubleshooting](#troubleshooting)
- [Additional Notes](#additional-notes)
- [References](#references)
## Prerequisites
- **Toolbox Installed on the Host System** `Fedora Silverblue` and `Fedora Workstation` both have toolbox by default, other distributions may need to install the [toolbox package](https://containertoolbx.org/install/).
- **NVIDIA Drivers and Graphics Card installed on Host System (optional)** To run CUDA program, such as `llama.cpp`, the host should be setup to access your NVIDIA hardware. Fedora Hosts can use the [RPM Fusion Repository](https://rpmfusion.org/Howto/NVIDIA).
- **Internet connectivity** to download packages.
### Monitoring NVIDIA CUDA Repositories
Before proceeding, it is advisable to check if NVIDIA has updated their CUDA repositories for your Fedora version. NVIDIA's repositories can be found at:
- [Fedora 40 CUDA Repository](https://developer.download.nvidia.com/compute/cuda/repos/fedora40/x86_64/)
- [Fedora 41 CUDA Repository](https://developer.download.nvidia.com/compute/cuda/repos/fedora41/x86_64/)
As of the latest update, these repositories do not contain the `cuda` meta-package or are missing essential components.
### Using the Fedora 39 CUDA Repository
Since the newer repositories are incomplete, we'll use the Fedora 39 repository:
- [Fedora 39 CUDA Repository](https://developer.download.nvidia.com/compute/cuda/repos/fedora39/x86_64/)
**Note:** Fedora 39 is no longer maintained, so we recommend using a toolbox environment to prevent system conflicts.
## Creating a Fedora Toolbox Environment
This guide focuses on Fedora hosts, but with small adjustments, it can work for other hosts. Using a Fedora 39 toolbox allows us to install the necessary packages without affecting the host system.
**Note:** Toolbox is available for other systems, and even without Toolbox, it is possible to use Podman or Docker.
We do not recommend installing on the host system, as Fedora 39 is out-of-maintenance, and instead you should upgrade to a maintained version of Fedora for your host.
1. **Create a Fedora 39 Toolbox:**
```bash
toolbox create --image registry.fedoraproject.org/fedora-toolbox:39 --container fedora-toolbox-39-cuda
```
2. **Enter the Toolbox:**
```bash
toolbox enter --container fedora-toolbox-39-cuda
```
Inside the toolbox, you have root privileges and can install packages without affecting the host system.
## Installing Essential Development Tools
1. **Synchronize the DNF Package Manager:**
```bash
sudo dnf distro-sync
```
2. **Install the Default Text Editor (Optional):**
```bash
sudo dnf install vim-default-editor --allowerasing
```
The `--allowerasing` flag resolves any package conflicts.
3. **Install Development Tools and Libraries:**
```bash
sudo dnf install @c-development @development-tools cmake
```
This installs essential packages for compiling software, including `gcc`, `make`, and other development headers.
## Adding the CUDA Repository
Add the NVIDIA CUDA repository to your DNF configuration:
```bash
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/fedora39/x86_64/cuda-fedora39.repo
```
After adding the repository, synchronize the package manager again:
```bash
sudo dnf distro-sync
```
## Installing `nvidia-driver-libs`
Attempt to install `nvidia-driver-libs`:
```bash
sudo dnf install nvidia-driver-libs
```
**Explanation:**
- `nvidia-driver-libs` contains necessary NVIDIA driver libraries required by CUDA.
- This step might fail due to conflicts with existing NVIDIA drivers on the host system.
## Manually Resolving Package Conflicts
If the installation fails due to conflicts, we'll manually download and install the required packages, excluding conflicting files.
### 1. Download the `nvidia-driver-libs` RPM
```bash
sudo dnf download --arch x86_64 nvidia-driver-libs
```
You should see a file similar to:
```
nvidia-driver-libs-560.35.05-1.fc39.x86_64.rpm
```
### 2. Attempt to Install the RPM
```bash
sudo dnf install nvidia-driver-libs-560.35.05-1.fc39.x86_64.rpm
```
**Expected Error:**
Installation may fail with errors pointing to conflicts with `egl-gbm` and `egl-wayland`.
**Note: It is important to carefully read the error messages to identify the exact paths that need to be excluded.**
### 3. Download Dependencies
```bash
sudo dnf download --arch x86_64 egl-gbm egl-wayland
```
### 4. Install `egl-gbm` with Excluded Paths
Exclude conflicting files during installation:
```bash
sudo rpm --install --verbose --hash \
--excludepath=/usr/lib64/libnvidia-egl-gbm.so.1.1.2 \
--excludepath=/usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json \
egl-gbm-1.1.2^20240919gitb24587d-3.fc39.x86_64.rpm
```
**Explanation:**
- The `--excludepath` option skips installing files that conflict with existing files.
- Adjust the paths based on the error messages you receive.
### 5. Install `egl-wayland` with Excluded Paths
```bash
sudo rpm --install --verbose --hash \
--excludepath=/usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json \
egl-wayland-1.1.17^20241118giteeb29e1-5.fc39.x86_64.rpm
```
### 6. Install `nvidia-driver-libs` with Excluded Paths
```bash
sudo rpm --install --verbose --hash \
--excludepath=/usr/share/glvnd/egl_vendor.d/10_nvidia.json \
--excludepath=/usr/share/nvidia/nvoptix.bin \
nvidia-driver-libs-560.35.05-1.fc39.x86_64.rpm
```
**Note:**
- Replace the paths with the ones causing conflicts in your installation if they differ.
- The `--verbose` and `--hash` options provide detailed output during installation.
## Finalizing the Installation of `nvidia-driver-libs`
After manually installing the dependencies, run:
```bash
sudo dnf install nvidia-driver-libs
```
You should receive a message indicating the package is already installed:
```
Package nvidia-driver-libs-3:560.35.05-1.fc39.x86_64 is already installed.
Dependencies resolved.
Nothing to do.
Complete!
```
## Installing the CUDA Meta-Package
Now that the driver libraries are installed, proceed to install CUDA:
```bash
sudo dnf install cuda
```
This installs the CUDA toolkit and associated packages.
## Configuring the Environment
To use CUDA, add its binary directory to your system's `PATH`.
1. **Create a Profile Script:**
```bash
sudo sh -c 'echo "export PATH=\$PATH:/usr/local/cuda/bin" >> /etc/profile.d/cuda.sh'
```
**Explanation:**
- We add to `/etc/profile.d/` as the `/etc/` folder is unique to this particular container, and is not shared with other containers or the host system.
- The backslash `\` before `$PATH` ensures the variable is correctly written into the script.
2. **Make the Script Executable:**
```bash
sudo chmod +x /etc/profile.d/cuda.sh
```
3. **Source the Script to Update Your Environment:**
```bash
source /etc/profile.d/cuda.sh
```
**Note:** This command updates your current shell session with the new `PATH`. The `/etc/profile.d/cuda.sh` script ensures that the CUDA binaries are available in your `PATH` for all future sessions.
## Verifying the Installation
To confirm that CUDA is correctly installed and configured, check the version of the NVIDIA CUDA Compiler (`nvcc`):
```bash
nvcc --version
```
You should see output similar to:
```
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0
```
This output confirms that the CUDA compiler is accessible and indicates the installed version.
## Conclusion
You have successfully set up CUDA on Fedora within a toolbox environment using the Fedora 39 CUDA repository. By manually resolving package conflicts and configuring the environment, you can develop CUDA applications without affecting your host system.
## Troubleshooting
- **Installation Failures:**
- If you encounter errors during installation, carefully read the error messages. They often indicate conflicting files or missing dependencies.
- Use the `--excludepath` option with `rpm` to exclude conflicting files during manual installations.
- **Driver Conflicts:**
- Since the host system may already have NVIDIA drivers installed, conflicts can arise. Using the toolbox environment helps isolate these issues.
- **Environment Variables Not Set:**
- If `nvcc` is not found after installation, ensure that `/usr/local/cuda/bin` is in your `PATH`.
- Run `echo $PATH` to check if the path is included.
- Re-source the profile script or open a new terminal session.
## Additional Notes
- **Updating CUDA in the Future:**
- Keep an eye on the official NVIDIA repositories for updates to your Fedora version.
- When an updated repository becomes available, adjust your `dnf` configuration accordingly.
- **Building `llama.cpp`:**
- With CUDA installed, you can follow these [build instructions for `llama.cpp`](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md) to compile it with CUDA support.
- Ensure that any CUDA-specific build flags or paths are correctly set in your build configuration.
- **Using the Toolbox Environment:**
- The toolbox environment is isolated from your host system, which helps prevent conflicts.
- Remember that system files and configurations inside the toolbox are separate from the host. By default the home directory of the user is shared between the host and the toolbox.
---
**Disclaimer:** Manually installing and modifying system packages can lead to instability of the container. The above steps are provided as a guideline and may need adjustments based on your specific system configuration. Always back up important data before making significant system changes, especially as your home folder is writable and shared with he toolbox.
**Acknowledgments:** Special thanks to the Fedora community and NVIDIA documentation for providing resources that assisted in creating this guide.
## References
- [Fedora Toolbox Documentation](https://docs.fedoraproject.org/en-US/fedora-silverblue/toolbox/)
- [NVIDIA CUDA Installation Guide](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html)
- [Podman Documentation](https://podman.io/get-started)
---
+5 -5
View File
@@ -28,7 +28,7 @@ The required steps to implement for an HF model are:
```python
@Model.register("MyModelForCausalLM")
class MyModel(Model):
model_arch = gguf.MODEL_ARCH.GROK
model_arch = gguf.MODEL_ARCH.MYMODEL
```
2. Define the layout of the GGUF tensors in [constants.py](/gguf-py/gguf/constants.py)
@@ -79,14 +79,14 @@ Depending on the model configuration, tokenizer, code and tensors layout, you wi
- `Model#set_vocab`
- `Model#write_tensors`
NOTE: Tensor names must end with `.weight` suffix, that is the convention and several tools like `quantize` expect this to proceed the weights.
NOTE: Tensor names must end with `.weight` or `.bias` suffixes, that is the convention and several tools like `quantize` expect this to proceed the weights.
### 2. Define the model architecture in `llama.cpp`
The model params and tensors layout must be defined in `llama.cpp`:
1. Define a new `llm_arch`
2. Define the tensors layout in `LLM_TENSOR_NAMES`
3. Add any non standard metadata in `llm_load_hparams`
3. Add any non-standard metadata in `llm_load_hparams`
4. Create the tensors for inference in `llm_load_tensors`
5. If the model has a RoPE operation, add the rope type in `llama_rope_type`
@@ -96,9 +96,9 @@ NOTE: The dimensions in `ggml` are typically in the reverse order of the `pytorc
This is the funniest part, you have to provide the inference graph implementation of the new model architecture in `llama_build_graph`.
Have a look at existing implementation like `build_llama`, `build_dbrx` or `build_bert`.
Have a look at existing implementations like `build_llama`, `build_dbrx` or `build_bert`.
When implementing a new graph, please note that the underlying `ggml` backends might not support them all, support for missing backend operations can be added in another PR.
Some `ggml` backends do not support all operations. Backend implementations can be added in a separate PR.
Note: to debug the inference graph: you can use [llama-eval-callback](/examples/eval-callback/).
@@ -1,4 +1,6 @@
#include "ggml.h"
#include "gguf.h"
#include "llama.h"
#include "common.h"
#include "log.h"
@@ -1,7 +1,9 @@
#include "ggml.h"
#include "gguf.h"
#include "arg.h"
#include "common.h"
#include "llama.h"
#include "ggml.h"
#include "pca.hpp"
#include "mean.hpp"
+4 -2
View File
@@ -1,7 +1,9 @@
#include "arg.h"
#include "common.h"
#include "ggml.h"
#include "ggml-alloc.h"
#include "gguf.h"
#include "arg.h"
#include "common.h"
#include <map>
#include <vector>
+1
View File
@@ -1,4 +1,5 @@
#include "ggml.h"
#include "gguf.h"
#include <cstdlib> /* abort() */
#include <cstddef>
+8 -6
View File
@@ -1,16 +1,18 @@
#include "ggml.h"
#include "gguf.h"
#include "llama.h"
#include "common.h"
#include <algorithm>
#include <cinttypes>
#include <climits>
#include <cstdio>
#include <cstdlib>
#include <stdexcept>
#include <cstring>
#include <fstream>
#include <string>
#include <vector>
#include <climits>
#include <cstdio>
#include <cstring>
#include <stdexcept>
#if defined(_WIN32)
#include <windows.h>
@@ -296,7 +298,7 @@ struct split_strategy {
total_size += ggml_nbytes(t);
}
total_size = total_size / 1000 / 1000; // convert to megabytes
printf("split %05d: n_tensors = %d, total_size = %zuM\n", i_split + 1, gguf_get_n_tensors(ctx_out), total_size);
printf("split %05d: n_tensors = %" PRIi64 ", total_size = %zuM\n", i_split + 1, gguf_get_n_tensors(ctx_out), total_size);
i_split++;
}
}
+10 -6
View File
@@ -1,10 +1,9 @@
#include "ggml.h"
#include "gguf.h"
#include <cstdio>
#include <cinttypes>
#include <string>
#include <sstream>
#include <fstream>
#include <vector>
#undef MIN
@@ -135,9 +134,10 @@ static bool gguf_ex_read_0(const std::string & fname) {
for (int i = 0; i < n_tensors; ++i) {
const char * name = gguf_get_tensor_name (ctx, i);
const size_t size = gguf_get_tensor_size (ctx, i);
const size_t offset = gguf_get_tensor_offset(ctx, i);
printf("%s: tensor[%d]: name = %s, offset = %zu\n", __func__, i, name, offset);
printf("%s: tensor[%d]: name = %s, size = %zu, offset = %zu\n", __func__, i, name, size, offset);
}
}
@@ -182,9 +182,10 @@ static bool gguf_ex_read_1(const std::string & fname, bool check_data) {
for (int i = 0; i < n_tensors; ++i) {
const char * name = gguf_get_tensor_name (ctx, i);
const size_t size = gguf_get_tensor_size (ctx, i);
const size_t offset = gguf_get_tensor_offset(ctx, i);
printf("%s: tensor[%d]: name = %s, offset = %zu\n", __func__, i, name, offset);
printf("%s: tensor[%d]: name = %s, size = %zu, offset = %zu\n", __func__, i, name, size, offset);
}
}
@@ -199,7 +200,8 @@ static bool gguf_ex_read_1(const std::string & fname, bool check_data) {
struct ggml_tensor * cur = ggml_get_tensor(ctx_data, name);
printf("%s: tensor[%d]: n_dims = %d, name = %s, data = %p\n", __func__, i, ggml_n_dims(cur), cur->name, cur->data);
printf("%s: tensor[%d]: n_dims = %d, ne = (%d, %d, %d, %d), name = %s, data = %p\n",
__func__, i, ggml_n_dims(cur), int(cur->ne[0]), int(cur->ne[1]), int(cur->ne[2]), int(cur->ne[3]), cur->name, cur->data);
// print first 10 elements
const float * data = (const float *) cur->data;
@@ -215,7 +217,7 @@ static bool gguf_ex_read_1(const std::string & fname, bool check_data) {
const float * data = (const float *) cur->data;
for (int j = 0; j < ggml_nelements(cur); ++j) {
if (data[j] != 100 + i) {
fprintf(stderr, "%s: tensor[%d]: data[%d] = %f\n", __func__, i, j, data[j]);
fprintf(stderr, "%s: tensor[%d], data[%d]: found %f, expected %f\n", __func__, i, j, data[j], float(100 + i));
gguf_free(ctx);
return false;
}
@@ -245,6 +247,8 @@ int main(int argc, char ** argv) {
check_data = false;
}
srand(123456);
const std::string fname(argv[1]);
const std::string mode (argv[2]);
+4 -2
View File
@@ -7,6 +7,7 @@
#include "ggml-cpu.h"
#include "ggml-alloc.h"
#include "ggml-backend.h"
#include "gguf.h"
//#ifdef GGML_USE_CUDA
//#include "ggml-cuda.h"
@@ -262,7 +263,7 @@ static std::string gguf_kv_to_str(const struct gguf_context * ctx_gguf, int i) {
{
const enum gguf_type arr_type = gguf_get_arr_type(ctx_gguf, i);
int arr_n = gguf_get_arr_n(ctx_gguf, i);
const void * data = gguf_get_arr_data(ctx_gguf, i);
const void * data = arr_type == GGUF_TYPE_STRING ? nullptr : gguf_get_arr_data(ctx_gguf, i);
std::stringstream ss;
ss << "[";
for (int j = 0; j < arr_n; j++) {
@@ -2734,7 +2735,8 @@ bool clip_model_quantize(const char * fname_inp, const char * fname_out, const i
total_size_org += orig_size;
total_size_new += new_size;
gguf_set_tensor_type(ctx_out, name.c_str(), new_type);
gguf_set_tensor_data(ctx_out, name.c_str(), new_data, new_size);
GGML_ASSERT(gguf_get_tensor_size(ctx_out, gguf_find_tensor(ctx_out, name.c_str())) == new_size);
gguf_set_tensor_data(ctx_out, name.c_str(), new_data);
fout.write((const char *)new_data, new_size);
size_t pad = GGML_PAD(new_size, gguf_get_alignment(ctx_out)) - new_size;
for (size_t j = 0; j < pad; ++j) {
+60 -3
View File
@@ -11,6 +11,8 @@
# include <curl/curl.h>
#endif
#include <signal.h>
#include <climits>
#include <cstdarg>
#include <cstdio>
@@ -25,6 +27,13 @@
#include "json.hpp"
#include "llama-cpp.h"
#if defined(__unix__) || (defined(__APPLE__) && defined(__MACH__)) || defined(_WIN32)
[[noreturn]] static void sigint_handler(int) {
printf("\n");
exit(0); // not ideal, but it's the only way to guarantee exit in all cases
}
#endif
GGML_ATTRIBUTE_FORMAT(1, 2)
static std::string fmt(const char * fmt, ...) {
va_list ap;
@@ -801,7 +810,20 @@ static int generate(LlamaData & llama_data, const std::string & prompt, std::str
static int read_user_input(std::string & user) {
std::getline(std::cin, user);
return user.empty(); // Should have data in happy path
if (std::cin.eof()) {
printf("\n");
return 1;
}
if (user == "/bye") {
return 1;
}
if (user.empty()) {
return 2;
}
return 0; // Should have data in happy path
}
// Function to generate a response based on the prompt
@@ -868,7 +890,25 @@ static bool is_stdout_a_terminal() {
#endif
}
// Function to tokenize the prompt
// Function to handle user input
static int get_user_input(std::string & user_input, const std::string & user) {
while (true) {
const int ret = handle_user_input(user_input, user);
if (ret == 1) {
return 1;
}
if (ret == 2) {
continue;
}
break;
}
return 0;
}
// Main chat loop function
static int chat_loop(LlamaData & llama_data, const std::string & user) {
int prev_len = 0;
llama_data.fmtted.resize(llama_n_ctx(llama_data.context.get()));
@@ -876,7 +916,8 @@ static int chat_loop(LlamaData & llama_data, const std::string & user) {
while (true) {
// Get user input
std::string user_input;
while (handle_user_input(user_input, user)) {
if (get_user_input(user_input, user) == 1) {
return 0;
}
add_message("user", user.empty() ? user_input : user, llama_data);
@@ -917,7 +958,23 @@ static std::string read_pipe_data() {
return result.str();
}
static void ctrl_c_handling() {
#if defined(__unix__) || (defined(__APPLE__) && defined(__MACH__))
struct sigaction sigint_action;
sigint_action.sa_handler = sigint_handler;
sigemptyset(&sigint_action.sa_mask);
sigint_action.sa_flags = 0;
sigaction(SIGINT, &sigint_action, NULL);
#elif defined(_WIN32)
auto console_ctrl_handler = +[](DWORD ctrl_type) -> BOOL {
return (ctrl_type == CTRL_C_EVENT) ? (sigint_handler(SIGINT), true) : false;
};
SetConsoleCtrlHandler(reinterpret_cast<PHANDLER_ROUTINE>(console_ctrl_handler), true);
#endif
}
int main(int argc, const char ** argv) {
ctrl_c_handling();
Opt opt;
const int ret = opt.init(argc, argv);
if (ret == 2) {
-3
View File
@@ -45,10 +45,7 @@ The project is under active development, and we are [looking for feedback and co
| `-ub, --ubatch-size N` | physical maximum batch size (default: 512)<br/>(env: LLAMA_ARG_UBATCH) |
| `--keep N` | number of tokens to keep from the initial prompt (default: 0, -1 = all) |
| `-fa, --flash-attn` | enable Flash Attention (default: disabled)<br/>(env: LLAMA_ARG_FLASH_ATTN) |
| `-p, --prompt PROMPT` | prompt to start generation with |
| `--no-perf` | disable internal libllama performance timings (default: false)<br/>(env: LLAMA_ARG_NO_PERF) |
| `-f, --file FNAME` | a file containing the prompt (default: none) |
| `-bf, --binary-file FNAME` | binary file containing the prompt (default: none) |
| `-e, --escape` | process escapes sequences (\n, \r, \t, \', \", \\) (default: true) |
| `--no-escape` | do not process escape sequences |
| `--rope-scaling {none,linear,yarn}` | RoPE frequency scaling method, defaults to linear unless specified by the model<br/>(env: LLAMA_ARG_ROPE_SCALING_TYPE) |
Binary file not shown.
+38 -34
View File
@@ -62,53 +62,57 @@
<!-- action buttons (top right) -->
<div class="flex items-center">
<div v-if="messages.length > 0" class="dropdown dropdown-end">
<!-- "more" button -->
<!-- "..." button -->
<button tabindex="0" role="button" class="btn m-1" :disabled="isGenerating">
<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="currentColor" class="bi bi-three-dots-vertical" viewBox="0 0 16 16">
<path d="M9.5 13a1.5 1.5 0 1 1-3 0 1.5 1.5 0 0 1 3 0m0-5a1.5 1.5 0 1 1-3 0 1.5 1.5 0 0 1 3 0m0-5a1.5 1.5 0 1 1-3 0 1.5 1.5 0 0 1 3 0"/>
</svg>
</button>
<!-- "more" dropdown menu -->
<!-- "delete" dropdown menu -->
<ul tabindex="0" class="dropdown-content menu bg-base-100 rounded-box z-[1] w-52 p-2 shadow">
<li @click="downloadConv(viewingConvId)"><a>Download</a></li>
<li class="text-error" @click="deleteConv(viewingConvId)"><a>Delete</a></li>
</ul>
</div>
<button class="btn" @click="showConfigDialog = true" :disabled="isGenerating">
<!-- settings button -->
<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="currentColor" class="bi bi-gear" viewBox="0 0 16 16">
<path d="M8 4.754a3.246 3.246 0 1 0 0 6.492 3.246 3.246 0 0 0 0-6.492M5.754 8a2.246 2.246 0 1 1 4.492 0 2.246 2.246 0 0 1-4.492 0"/>
<path d="M9.796 1.343c-.527-1.79-3.065-1.79-3.592 0l-.094.319a.873.873 0 0 1-1.255.52l-.292-.16c-1.64-.892-3.433.902-2.54 2.541l.159.292a.873.873 0 0 1-.52 1.255l-.319.094c-1.79.527-1.79 3.065 0 3.592l.319.094a.873.873 0 0 1 .52 1.255l-.16.292c-.892 1.64.901 3.434 2.541 2.54l.292-.159a.873.873 0 0 1 1.255.52l.094.319c.527 1.79 3.065 1.79 3.592 0l.094-.319a.873.873 0 0 1 1.255-.52l.292.16c1.64.893 3.434-.902 2.54-2.541l-.159-.292a.873.873 0 0 1 .52-1.255l.319-.094c1.79-.527 1.79-3.065 0-3.592l-.319-.094a.873.873 0 0 1-.52-1.255l.16-.292c.893-1.64-.902-3.433-2.541-2.54l-.292.159a.873.873 0 0 1-1.255-.52zm-2.633.283c.246-.835 1.428-.835 1.674 0l.094.319a1.873 1.873 0 0 0 2.693 1.115l.291-.16c.764-.415 1.6.42 1.184 1.185l-.159.292a1.873 1.873 0 0 0 1.116 2.692l.318.094c.835.246.835 1.428 0 1.674l-.319.094a1.873 1.873 0 0 0-1.115 2.693l.16.291c.415.764-.42 1.6-1.185 1.184l-.291-.159a1.873 1.873 0 0 0-2.693 1.116l-.094.318c-.246.835-1.428.835-1.674 0l-.094-.319a1.873 1.873 0 0 0-2.692-1.115l-.292.16c-.764.415-1.6-.42-1.184-1.185l.159-.291A1.873 1.873 0 0 0 1.945 8.93l-.319-.094c-.835-.246-.835-1.428 0-1.674l.319-.094A1.873 1.873 0 0 0 3.06 4.377l-.16-.292c-.415-.764.42-1.6 1.185-1.184l.292.159a1.873 1.873 0 0 0 2.692-1.115z"/>
</svg>
</button>
<div class="tooltip tooltip-bottom" data-tip="Settings">
<button class="btn" @click="showConfigDialog = true" :disabled="isGenerating">
<!-- settings button -->
<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="currentColor" class="bi bi-gear" viewBox="0 0 16 16">
<path d="M8 4.754a3.246 3.246 0 1 0 0 6.492 3.246 3.246 0 0 0 0-6.492M5.754 8a2.246 2.246 0 1 1 4.492 0 2.246 2.246 0 0 1-4.492 0"/>
<path d="M9.796 1.343c-.527-1.79-3.065-1.79-3.592 0l-.094.319a.873.873 0 0 1-1.255.52l-.292-.16c-1.64-.892-3.433.902-2.54 2.541l.159.292a.873.873 0 0 1-.52 1.255l-.319.094c-1.79.527-1.79 3.065 0 3.592l.319.094a.873.873 0 0 1 .52 1.255l-.16.292c-.892 1.64.901 3.434 2.541 2.54l.292-.159a.873.873 0 0 1 1.255.52l.094.319c.527 1.79 3.065 1.79 3.592 0l.094-.319a.873.873 0 0 1 1.255-.52l.292.16c1.64.893 3.434-.902 2.54-2.541l-.159-.292a.873.873 0 0 1 .52-1.255l.319-.094c1.79-.527 1.79-3.065 0-3.592l-.319-.094a.873.873 0 0 1-.52-1.255l.16-.292c.893-1.64-.902-3.433-2.541-2.54l-.292.159a.873.873 0 0 1-1.255-.52zm-2.633.283c.246-.835 1.428-.835 1.674 0l.094.319a1.873 1.873 0 0 0 2.693 1.115l.291-.16c.764-.415 1.6.42 1.184 1.185l-.159.292a1.873 1.873 0 0 0 1.116 2.692l.318.094c.835.246.835 1.428 0 1.674l-.319.094a1.873 1.873 0 0 0-1.115 2.693l.16.291c.415.764-.42 1.6-1.185 1.184l-.291-.159a1.873 1.873 0 0 0-2.693 1.116l-.094.318c-.246.835-1.428.835-1.674 0l-.094-.319a1.873 1.873 0 0 0-2.692-1.115l-.292.16c-.764.415-1.6-.42-1.184-1.185l.159-.291A1.873 1.873 0 0 0 1.945 8.93l-.319-.094c-.835-.246-.835-1.428 0-1.674l.319-.094A1.873 1.873 0 0 0 3.06 4.377l-.16-.292c-.415-.764.42-1.6 1.185-1.184l.292.159a1.873 1.873 0 0 0 2.692-1.115z"/>
</svg>
</button>
</div>
<!-- theme controller is copied from https://daisyui.com/components/theme-controller/ -->
<div class="dropdown dropdown-end dropdown-bottom">
<div tabindex="0" role="button" class="btn m-1">
<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="currentColor" class="bi bi-palette2" viewBox="0 0 16 16">
<path d="M0 .5A.5.5 0 0 1 .5 0h5a.5.5 0 0 1 .5.5v5.277l4.147-4.131a.5.5 0 0 1 .707 0l3.535 3.536a.5.5 0 0 1 0 .708L10.261 10H15.5a.5.5 0 0 1 .5.5v5a.5.5 0 0 1-.5.5H3a3 3 0 0 1-2.121-.879A3 3 0 0 1 0 13.044m6-.21 7.328-7.3-2.829-2.828L6 7.188zM4.5 13a1.5 1.5 0 1 0-3 0 1.5 1.5 0 0 0 3 0M15 15v-4H9.258l-4.015 4zM0 .5v12.495zm0 12.495V13z"/>
</svg>
<div class="tooltip tooltip-bottom" data-tip="Themes">
<div class="dropdown dropdown-end dropdown-bottom">
<div tabindex="0" role="button" class="btn m-1">
<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="currentColor" class="bi bi-palette2" viewBox="0 0 16 16">
<path d="M0 .5A.5.5 0 0 1 .5 0h5a.5.5 0 0 1 .5.5v5.277l4.147-4.131a.5.5 0 0 1 .707 0l3.535 3.536a.5.5 0 0 1 0 .708L10.261 10H15.5a.5.5 0 0 1 .5.5v5a.5.5 0 0 1-.5.5H3a3 3 0 0 1-2.121-.879A3 3 0 0 1 0 13.044m6-.21 7.328-7.3-2.829-2.828L6 7.188zM4.5 13a1.5 1.5 0 1 0-3 0 1.5 1.5 0 0 0 3 0M15 15v-4H9.258l-4.015 4zM0 .5v12.495zm0 12.495V13z"/>
</svg>
</div>
<ul tabindex="0" class="dropdown-content bg-base-300 rounded-box z-[1] w-52 p-2 shadow-2xl h-80 overflow-y-auto">
<li>
<button
class="btn btn-sm btn-block btn-ghost justify-start"
:class="{ 'btn-active': selectedTheme === 'auto' }"
@click="setSelectedTheme('auto')">
auto
</button>
</li>
<li v-for="theme in themes">
<input
type="radio"
name="theme-dropdown"
class="theme-controller btn btn-sm btn-block btn-ghost justify-start"
:aria-label="theme"
:value="theme"
:checked="selectedTheme === theme"
@click="setSelectedTheme(theme)" />
</li>
</ul>
</div>
<ul tabindex="0" class="dropdown-content bg-base-300 rounded-box z-[1] w-52 p-2 shadow-2xl h-80 overflow-y-auto">
<li>
<button
class="btn btn-sm btn-block btn-ghost justify-start"
:class="{ 'btn-active': selectedTheme === 'auto' }"
@click="setSelectedTheme('auto')">
auto
</button>
</li>
<li v-for="theme in themes">
<input
type="radio"
name="theme-dropdown"
class="theme-controller btn btn-sm btn-block btn-ghost justify-start"
:aria-label="theme"
:value="theme"
:checked="selectedTheme === theme"
@click="setSelectedTheme(theme)" />
</li>
</ul>
</div>
</div>
</div>
+2 -1
View File
@@ -243,7 +243,8 @@ set(GGML_PUBLIC_HEADERS
include/ggml-metal.h
include/ggml-rpc.h
include/ggml-sycl.h
include/ggml-vulkan.h)
include/ggml-vulkan.h
include/gguf.h)
set_target_properties(ggml PROPERTIES PUBLIC_HEADER "${GGML_PUBLIC_HEADERS}")
#if (GGML_METAL)
+1
View File
@@ -7,6 +7,7 @@
#include "ggml.h"
#include "ggml-alloc.h"
#include "ggml-backend.h"
#include "gguf.h"
#include <memory>
// Smart pointers for ggml types
+10 -140
View File
@@ -241,12 +241,6 @@
#define GGML_ROPE_TYPE_MROPE 8
#define GGML_ROPE_TYPE_VISION 24
#define GGUF_MAGIC "GGUF"
#define GGUF_VERSION 3
#define GGUF_DEFAULT_ALIGNMENT 32
#define GGML_UNUSED(x) (void)(x)
#define GGML_PAD(x, n) (((x) + (n) - 1) & ~((n) - 1))
@@ -403,12 +397,6 @@ extern "C" {
GGML_PREC_F32,
};
enum ggml_backend_type {
GGML_BACKEND_TYPE_CPU = 0,
GGML_BACKEND_TYPE_GPU = 10,
GGML_BACKEND_TYPE_GPU_SPLIT = 20,
};
// model file types
enum ggml_ftype {
GGML_FTYPE_UNKNOWN = -1,
@@ -513,6 +501,7 @@ extern "C" {
GGML_OP_GET_REL_POS,
GGML_OP_ADD_REL_POS,
GGML_OP_RWKV_WKV6,
GGML_OP_GATED_LINEAR_ATTN,
GGML_OP_UNARY,
@@ -587,8 +576,6 @@ extern "C" {
struct ggml_tensor {
enum ggml_type type;
GGML_DEPRECATED(enum ggml_backend_type backend, "use the buffer type to find the storage location of the tensor");
struct ggml_backend_buffer * buffer;
int64_t ne[GGML_MAX_DIMS]; // number of elements
@@ -1873,6 +1860,15 @@ extern "C" {
struct ggml_tensor * td,
struct ggml_tensor * state);
GGML_API struct ggml_tensor * ggml_gated_linear_attn(
struct ggml_context * ctx,
struct ggml_tensor * k,
struct ggml_tensor * v,
struct ggml_tensor * q,
struct ggml_tensor * g,
struct ggml_tensor * state,
float scale);
// custom operators
typedef void (*ggml_unary_op_f32_t) (const int, float *, const float *);
@@ -2111,132 +2107,6 @@ extern "C" {
int64_t n_per_row,
const float * imatrix);
//
// gguf
//
enum gguf_type {
GGUF_TYPE_UINT8 = 0,
GGUF_TYPE_INT8 = 1,
GGUF_TYPE_UINT16 = 2,
GGUF_TYPE_INT16 = 3,
GGUF_TYPE_UINT32 = 4,
GGUF_TYPE_INT32 = 5,
GGUF_TYPE_FLOAT32 = 6,
GGUF_TYPE_BOOL = 7,
GGUF_TYPE_STRING = 8,
GGUF_TYPE_ARRAY = 9,
GGUF_TYPE_UINT64 = 10,
GGUF_TYPE_INT64 = 11,
GGUF_TYPE_FLOAT64 = 12,
GGUF_TYPE_COUNT, // marks the end of the enum
};
struct gguf_context;
struct gguf_init_params {
bool no_alloc;
// if not NULL, create a ggml_context and allocate the tensor data in it
struct ggml_context ** ctx;
};
GGML_API struct gguf_context * gguf_init_empty(void);
GGML_API struct gguf_context * gguf_init_from_file(const char * fname, struct gguf_init_params params);
//GGML_API struct gguf_context * gguf_init_from_buffer(..);
GGML_API void gguf_free(struct gguf_context * ctx);
GGML_API const char * gguf_type_name(enum gguf_type type);
GGML_API int gguf_get_version (const struct gguf_context * ctx);
GGML_API size_t gguf_get_alignment (const struct gguf_context * ctx);
GGML_API size_t gguf_get_data_offset(const struct gguf_context * ctx);
GGML_API void * gguf_get_data (const struct gguf_context * ctx);
GGML_API int gguf_get_n_kv(const struct gguf_context * ctx);
GGML_API int gguf_find_key(const struct gguf_context * ctx, const char * key);
GGML_API const char * gguf_get_key (const struct gguf_context * ctx, int key_id);
GGML_API enum gguf_type gguf_get_kv_type (const struct gguf_context * ctx, int key_id);
GGML_API enum gguf_type gguf_get_arr_type(const struct gguf_context * ctx, int key_id);
// will abort if the wrong type is used for the key
GGML_API uint8_t gguf_get_val_u8 (const struct gguf_context * ctx, int key_id);
GGML_API int8_t gguf_get_val_i8 (const struct gguf_context * ctx, int key_id);
GGML_API uint16_t gguf_get_val_u16 (const struct gguf_context * ctx, int key_id);
GGML_API int16_t gguf_get_val_i16 (const struct gguf_context * ctx, int key_id);
GGML_API uint32_t gguf_get_val_u32 (const struct gguf_context * ctx, int key_id);
GGML_API int32_t gguf_get_val_i32 (const struct gguf_context * ctx, int key_id);
GGML_API float gguf_get_val_f32 (const struct gguf_context * ctx, int key_id);
GGML_API uint64_t gguf_get_val_u64 (const struct gguf_context * ctx, int key_id);
GGML_API int64_t gguf_get_val_i64 (const struct gguf_context * ctx, int key_id);
GGML_API double gguf_get_val_f64 (const struct gguf_context * ctx, int key_id);
GGML_API bool gguf_get_val_bool(const struct gguf_context * ctx, int key_id);
GGML_API const char * gguf_get_val_str (const struct gguf_context * ctx, int key_id);
GGML_API const void * gguf_get_val_data(const struct gguf_context * ctx, int key_id);
GGML_API int gguf_get_arr_n (const struct gguf_context * ctx, int key_id);
GGML_API const void * gguf_get_arr_data(const struct gguf_context * ctx, int key_id);
GGML_API const char * gguf_get_arr_str (const struct gguf_context * ctx, int key_id, int i);
GGML_API int gguf_get_n_tensors (const struct gguf_context * ctx);
GGML_API int gguf_find_tensor (const struct gguf_context * ctx, const char * name);
GGML_API size_t gguf_get_tensor_offset(const struct gguf_context * ctx, int i);
GGML_API char * gguf_get_tensor_name (const struct gguf_context * ctx, int i);
GGML_API enum ggml_type gguf_get_tensor_type (const struct gguf_context * ctx, int i);
// removes key if it exists
GGML_API void gguf_remove_key(struct gguf_context * ctx, const char * key);
// overrides existing values or adds a new one
GGML_API void gguf_set_val_u8 (struct gguf_context * ctx, const char * key, uint8_t val);
GGML_API void gguf_set_val_i8 (struct gguf_context * ctx, const char * key, int8_t val);
GGML_API void gguf_set_val_u16 (struct gguf_context * ctx, const char * key, uint16_t val);
GGML_API void gguf_set_val_i16 (struct gguf_context * ctx, const char * key, int16_t val);
GGML_API void gguf_set_val_u32 (struct gguf_context * ctx, const char * key, uint32_t val);
GGML_API void gguf_set_val_i32 (struct gguf_context * ctx, const char * key, int32_t val);
GGML_API void gguf_set_val_f32 (struct gguf_context * ctx, const char * key, float val);
GGML_API void gguf_set_val_u64 (struct gguf_context * ctx, const char * key, uint64_t val);
GGML_API void gguf_set_val_i64 (struct gguf_context * ctx, const char * key, int64_t val);
GGML_API void gguf_set_val_f64 (struct gguf_context * ctx, const char * key, double val);
GGML_API void gguf_set_val_bool(struct gguf_context * ctx, const char * key, bool val);
GGML_API void gguf_set_val_str (struct gguf_context * ctx, const char * key, const char * val);
GGML_API void gguf_set_arr_data(struct gguf_context * ctx, const char * key, enum gguf_type type, const void * data, int n);
GGML_API void gguf_set_arr_str (struct gguf_context * ctx, const char * key, const char ** data, int n);
// set or add KV pairs from another context
GGML_API void gguf_set_kv(struct gguf_context * ctx, struct gguf_context * src);
// manage tensor info
GGML_API void gguf_add_tensor(struct gguf_context * ctx, const struct ggml_tensor * tensor);
GGML_API void gguf_set_tensor_type(struct gguf_context * ctx, const char * name, enum ggml_type type);
GGML_API void gguf_set_tensor_data(struct gguf_context * ctx, const char * name, const void * data, size_t size);
// writing gguf files can be done in 2 ways:
//
// - write the entire gguf_context to a binary file in a single pass:
//
// gguf_write_to_file(ctx, fname);
//
// - first prepare a file with a placeholder for the meta data, write the tensor data, then write the meta data:
//
// FILE * f = fopen(fname, "wb");
// fseek(f, gguf_get_meta_size(ctx), SEEK_SET);
// fwrite(f, ...);
// void * data = gguf_meta_get_meta_data(ctx);
// fseek(f, 0, SEEK_SET);
// fwrite(f, data, gguf_get_meta_size(ctx));
// free(data);
// fclose(f);
//
// write the entire context to a binary file
GGML_API void gguf_write_to_file(const struct gguf_context * ctx, const char * fname, bool only_meta);
// get the size in bytes of the meta data (header, kv pairs, tensor info) including padding
GGML_API size_t gguf_get_meta_size(const struct gguf_context * ctx);
GGML_API void gguf_get_meta_data(const struct gguf_context * ctx, void * data);
#ifdef __cplusplus
// restrict not standard in C++
# if defined(__GNUC__)
+202
View File
@@ -0,0 +1,202 @@
// This file contains functionality related to "GGUF" files, the binary file format used by ggml.
// GGUF files have the following structure:
//
// 1. File magic "GGUF" (4 bytes).
// 2. File version (uint32_t).
// 3. Number of ggml tensors in file (int64_t).
// 4. Number of key-value-pairs in file (int64_t).
// 5. For each KV pair:
// 1. The key (string).
// 2. The value type (gguf_type).
// 3a. If the value type is GGUF_TYPE_ARRAY:
// 1. The type of the array (gguf_type).
// 2. The number of elements in the array (uint64_t).
// 3. The binary representation of each element in the array.
// 3b. Otherwise:
// 1. The binary representation of the value.
// 6. For each ggml tensor:
// 1. The tensor name (string).
// 2. The number of dimensions of the tensor (uint32_t).
// 3. For each dimension:
// 1. The size of the tensor in the dimension (int64_t).
// 4. The tensor data type (ggml_type).
// 5. The tensor data offset in the tensor data binary blob (uint64_t).
// 7. The tensor data binary blob (optional, aligned).
//
// Strings are serialized as the string length (uint64_t) followed by the C string without the null terminator.
// All enums are stored as int32_t.
// All bool values are stored as int8_t.
// If the special key "general.alignment" (uint32_t) is defined it is used for alignment,
// otherwise GGUF_DEFAULT_ALIGNMENT is used.
//
// Module maintainer: Johannes Gäßler (@JohannesGaessler, johannesg@5d6.de)
#pragma once
#include "ggml.h"
#include <stdbool.h>
#include <stdint.h>
#define GGUF_MAGIC "GGUF"
#define GGUF_VERSION 3
#define GGUF_KEY_GENERAL_ALIGNMENT "general.alignment"
#define GGUF_DEFAULT_ALIGNMENT 32
#ifdef __cplusplus
extern "C" {
#endif
// types that can be stored as GGUF KV data
enum gguf_type {
GGUF_TYPE_UINT8 = 0,
GGUF_TYPE_INT8 = 1,
GGUF_TYPE_UINT16 = 2,
GGUF_TYPE_INT16 = 3,
GGUF_TYPE_UINT32 = 4,
GGUF_TYPE_INT32 = 5,
GGUF_TYPE_FLOAT32 = 6,
GGUF_TYPE_BOOL = 7,
GGUF_TYPE_STRING = 8,
GGUF_TYPE_ARRAY = 9,
GGUF_TYPE_UINT64 = 10,
GGUF_TYPE_INT64 = 11,
GGUF_TYPE_FLOAT64 = 12,
GGUF_TYPE_COUNT, // marks the end of the enum
};
struct gguf_context;
struct gguf_init_params {
bool no_alloc;
// if not NULL, create a ggml_context and allocate the tensor data in it
struct ggml_context ** ctx;
};
GGML_API struct gguf_context * gguf_init_empty(void);
GGML_API struct gguf_context * gguf_init_from_file(const char * fname, struct gguf_init_params params);
//GGML_API struct gguf_context * gguf_init_from_buffer(..);
GGML_API void gguf_free(struct gguf_context * ctx);
GGML_API const char * gguf_type_name(enum gguf_type type);
GGML_API uint32_t gguf_get_version (const struct gguf_context * ctx);
GGML_API size_t gguf_get_alignment (const struct gguf_context * ctx);
GGML_API size_t gguf_get_data_offset(const struct gguf_context * ctx);
GGML_API int64_t gguf_get_n_kv(const struct gguf_context * ctx);
GGML_API int64_t gguf_find_key(const struct gguf_context * ctx, const char * key); // returns -1 if key is not found
GGML_API const char * gguf_get_key (const struct gguf_context * ctx, int64_t key_id);
GGML_API enum gguf_type gguf_get_kv_type (const struct gguf_context * ctx, int64_t key_id);
GGML_API enum gguf_type gguf_get_arr_type(const struct gguf_context * ctx, int64_t key_id);
// will abort if the wrong type is used for the key
GGML_API uint8_t gguf_get_val_u8 (const struct gguf_context * ctx, int64_t key_id);
GGML_API int8_t gguf_get_val_i8 (const struct gguf_context * ctx, int64_t key_id);
GGML_API uint16_t gguf_get_val_u16 (const struct gguf_context * ctx, int64_t key_id);
GGML_API int16_t gguf_get_val_i16 (const struct gguf_context * ctx, int64_t key_id);
GGML_API uint32_t gguf_get_val_u32 (const struct gguf_context * ctx, int64_t key_id);
GGML_API int32_t gguf_get_val_i32 (const struct gguf_context * ctx, int64_t key_id);
GGML_API float gguf_get_val_f32 (const struct gguf_context * ctx, int64_t key_id);
GGML_API uint64_t gguf_get_val_u64 (const struct gguf_context * ctx, int64_t key_id);
GGML_API int64_t gguf_get_val_i64 (const struct gguf_context * ctx, int64_t key_id);
GGML_API double gguf_get_val_f64 (const struct gguf_context * ctx, int64_t key_id);
GGML_API bool gguf_get_val_bool(const struct gguf_context * ctx, int64_t key_id);
GGML_API const char * gguf_get_val_str (const struct gguf_context * ctx, int64_t key_id);
GGML_API const void * gguf_get_val_data(const struct gguf_context * ctx, int64_t key_id);
GGML_API size_t gguf_get_arr_n (const struct gguf_context * ctx, int64_t key_id);
// get raw pointer to the first element of the array with the given key_id
// for bool arrays, note that they are always stored as int8 on all platforms (usually this makes no difference)
GGML_API const void * gguf_get_arr_data(const struct gguf_context * ctx, int64_t key_id);
// get ith C string from array with given key_id
GGML_API const char * gguf_get_arr_str (const struct gguf_context * ctx, int64_t key_id, size_t i);
GGML_API int64_t gguf_get_n_tensors (const struct gguf_context * ctx);
GGML_API int64_t gguf_find_tensor (const struct gguf_context * ctx, const char * name); // returns -1 if the tensor is not found
GGML_API size_t gguf_get_tensor_offset(const struct gguf_context * ctx, int64_t tensor_id);
GGML_API const char * gguf_get_tensor_name (const struct gguf_context * ctx, int64_t tensor_id);
GGML_API enum ggml_type gguf_get_tensor_type (const struct gguf_context * ctx, int64_t tensor_id);
GGML_API size_t gguf_get_tensor_size (const struct gguf_context * ctx, int64_t tensor_id);
// removes key if it exists, returns id that the key had prior to removal (-1 if it didn't exist)
GGML_API int64_t gguf_remove_key(struct gguf_context * ctx, const char * key);
// overrides an existing KV pair or adds a new one, the new KV pair is always at the back
GGML_API void gguf_set_val_u8 (struct gguf_context * ctx, const char * key, uint8_t val);
GGML_API void gguf_set_val_i8 (struct gguf_context * ctx, const char * key, int8_t val);
GGML_API void gguf_set_val_u16 (struct gguf_context * ctx, const char * key, uint16_t val);
GGML_API void gguf_set_val_i16 (struct gguf_context * ctx, const char * key, int16_t val);
GGML_API void gguf_set_val_u32 (struct gguf_context * ctx, const char * key, uint32_t val);
GGML_API void gguf_set_val_i32 (struct gguf_context * ctx, const char * key, int32_t val);
GGML_API void gguf_set_val_f32 (struct gguf_context * ctx, const char * key, float val);
GGML_API void gguf_set_val_u64 (struct gguf_context * ctx, const char * key, uint64_t val);
GGML_API void gguf_set_val_i64 (struct gguf_context * ctx, const char * key, int64_t val);
GGML_API void gguf_set_val_f64 (struct gguf_context * ctx, const char * key, double val);
GGML_API void gguf_set_val_bool(struct gguf_context * ctx, const char * key, bool val);
GGML_API void gguf_set_val_str (struct gguf_context * ctx, const char * key, const char * val);
// creates a new array with n elements of the given type and copies the corresponding number of bytes from data
GGML_API void gguf_set_arr_data(struct gguf_context * ctx, const char * key, enum gguf_type type, const void * data, size_t n);
// creates a new array with n strings and copies the corresponding strings from data
GGML_API void gguf_set_arr_str (struct gguf_context * ctx, const char * key, const char ** data, size_t n);
// set or add KV pairs from another context
GGML_API void gguf_set_kv(struct gguf_context * ctx, const struct gguf_context * src);
// add tensor to GGUF context, tensor name must be unique
GGML_API void gguf_add_tensor(struct gguf_context * ctx, const struct ggml_tensor * tensor);
// after changing a tensor's type, the offsets of all tensors with higher indices are immediately recalculated
// in such a way that the tensor data remains as one contiguous block (except for padding)
GGML_API void gguf_set_tensor_type(struct gguf_context * ctx, const char * name, enum ggml_type type);
// assumes that at least gguf_get_tensor_size bytes can be read from data
GGML_API void gguf_set_tensor_data(struct gguf_context * ctx, const char * name, const void * data);
// writing gguf files can be done in 3 ways:
//
// - write the entire gguf_context to a binary file in a single pass:
//
// gguf_write_to_file(ctx, fname, /*only_meta =*/ false);
//
// - write only the meta data to a file, then re-open the file and append the tensor data:
//
// gguf_write_to_file(ctx, fname, /*only_meta =*/ true);
// FILE * f = fopen(fname, "ab");
// fwrite(f, ...); // write tensor data
// fclose(f);
//
// - first prepare a file with a placeholder for the meta data, write the tensor data, then write the meta data:
//
// FILE * f = fopen(fname, "wb");
// const size_t size_meta = gguf_get_meta_size(ctx);
// fseek(f, size_meta, SEEK_SET);
// fwrite(f, ...); // write tensor data
// void * data = malloc(size_meta);
// gguf_get_meta_data(ctx, data);
// rewind(f);
// fwrite(data, 1, data, f);
// free(data);
// fclose(f);
//
// write the entire context to a binary file
GGML_API bool gguf_write_to_file(const struct gguf_context * ctx, const char * fname, bool only_meta);
// get the size in bytes of the meta data (header, kv pairs, tensor info) including padding
GGML_API size_t gguf_get_meta_size(const struct gguf_context * ctx);
// writes the meta data to pointer "data"
GGML_API void gguf_get_meta_data(const struct gguf_context * ctx, void * data);
#ifdef __cplusplus
}
#endif
+3 -1
View File
@@ -208,6 +208,7 @@ add_library(ggml-base
../include/ggml-backend.h
../include/ggml-cpp.h
../include/ggml-opt.h
../include/gguf.h
ggml.c
ggml-alloc.c
ggml-backend.cpp
@@ -215,7 +216,8 @@ add_library(ggml-base
ggml-threading.cpp
ggml-threading.h
ggml-quants.c
ggml-quants.h)
ggml-quants.h
gguf.cpp)
target_include_directories(ggml-base PRIVATE .)
+5
View File
@@ -574,4 +574,9 @@ void ggml_backend_load_all_from_path(const char * dir_path) {
ggml_backend_load_best("opencl", silent, dir_path);
ggml_backend_load_best("musa", silent, dir_path);
ggml_backend_load_best("cpu", silent, dir_path);
// check the environment variable GGML_BACKEND_PATH to load an out-of-tree backend
const char * backend_path = std::getenv("GGML_BACKEND_PATH");
if (backend_path) {
ggml_backend_load(backend_path);
}
}
+1 -1
View File
@@ -764,7 +764,7 @@ static int ggml_backend_sched_backend_id_from_cur(ggml_backend_sched_t sched, st
if (tensor->op != GGML_OP_ROPE && src->buffer != NULL && src->buffer->usage == GGML_BACKEND_BUFFER_USAGE_WEIGHTS) {
int src_backend_id = ggml_backend_sched_backend_from_buffer(sched, src, tensor);
// check if a backend with higher prio wants to offload the op
if (src_backend_id == sched->n_backends - 1) {
if (src_backend_id == sched->n_backends - 1 && ggml_backend_buffer_is_host(src->buffer)) {
for (int b = 0; b < src_backend_id; b++) {
if (ggml_backend_supports_op(sched->backends[b], tensor) && ggml_backend_offload_op(sched->backends[b], tensor)) {
SET_CAUSE(tensor, "1.off");
+2
View File
@@ -4169,6 +4169,8 @@ static ggml_backend_buffer_t ggml_backend_cpu_aarch64_buffer_type_alloc_buffer(g
buffer->buft = buft;
buffer->iface.init_tensor = ggml_backend_cpu_aarch64_buffer_init_tensor;
buffer->iface.set_tensor = ggml_backend_cpu_aarch64_buffer_set_tensor;
buffer->iface.get_tensor = nullptr;
buffer->iface.cpy_tensor = nullptr;
return buffer;
}
+198 -2
View File
@@ -11803,9 +11803,9 @@ static void ggml_compute_forward_add_rel_pos(
static void ggml_compute_forward_rwkv_wkv6_f32(
const struct ggml_compute_params * params,
struct ggml_tensor * dst) {
const int64_t T = dst->src[1]->ne[3];
const int64_t T = dst->src[1]->ne[2];
const int64_t C = dst->ne[0];
const int64_t HEADS = dst->src[1]->ne[2];
const int64_t HEADS = dst->src[1]->ne[1];
const int64_t n_seqs = dst->src[5]->ne[1];
const int64_t head_size = C / HEADS;
@@ -12000,6 +12000,197 @@ static void ggml_compute_forward_rwkv_wkv6(
}
}
// ggml_compute_forward_gla
static void ggml_compute_forward_gla_f32(
const struct ggml_compute_params * params,
struct ggml_tensor * dst) {
const int64_t T = dst->src[1]->ne[2];
const int64_t C = dst->ne[0];
const int64_t HEADS = dst->src[1]->ne[1];
const int64_t n_seqs = dst->src[4]->ne[1];
const int64_t head_size = C / HEADS;
const float scale = ggml_get_op_params_f32(dst, 0);
float * dst_data = (float *) dst->data;
float * state = ((float *) dst->data) + C * T;
const int ith = params->ith;
const int nth = params->nth;
if (ith >= HEADS) {
return;
}
const int h_start = (HEADS * ith) / nth;
const int h_end = ((HEADS * (ith + 1)) / nth < HEADS) ?
(HEADS * (ith + 1)) / nth : HEADS;
float * k = (float *) dst->src[0]->data;
float * v = (float *) dst->src[1]->data;
float * q = (float *) dst->src[2]->data;
float * g = (float *) dst->src[3]->data;
size_t t_stride = HEADS * head_size; // Same to C
size_t h_stride = C / HEADS;
GGML_ASSERT(C % HEADS == 0); // C must be divisible by HEADS
size_t h_stride_2d = head_size * head_size;
if (ith == 0) {
memset(dst_data, 0, T * C * sizeof(float));
}
ggml_barrier(params->threadpool);
#if defined(__AVX__) && !defined(__AVX512F__)
#define GGML_F32X GGML_F32x8
#define GGML_F32X_SET1 GGML_F32x8_SET1
#define GGML_F32X_LOAD GGML_F32x8_LOAD
#define GGML_F32X_STORE GGML_F32x8_STORE
#define GGML_F32X_MUL GGML_F32x8_MUL
#define GGML_F32X_FMA GGML_F32x8_FMA
#define GLA_VECTOR_SIZE 8
#elif defined(__AVX512F__)
#define GGML_F32X GGML_F32x16
#define GGML_F32X_SET1 GGML_F32x16_SET1
#define GGML_F32X_LOAD GGML_F32x16_LOAD
#define GGML_F32X_STORE GGML_F32x16_STORE
#define GGML_F32X_MUL GGML_F32x16_MUL
#define GGML_F32X_FMA GGML_F32x16_FMA
#define GLA_VECTOR_SIZE 16
#elif defined(__ARM_NEON) && defined(__aarch64__)
#define GGML_F32X GGML_F32x4
#define GGML_F32X_SET1 GGML_F32x4_SET1
#define GGML_F32X_LOAD GGML_F32x4_LOAD
#define GGML_F32X_STORE GGML_F32x4_STORE
#define GGML_F32X_MUL GGML_F32x4_MUL
#define GGML_F32X_FMA GGML_F32x4_FMA
#define GLA_VECTOR_SIZE 4
#endif
#ifdef GLA_VECTOR_SIZE
const int64_t vec_count = head_size / GLA_VECTOR_SIZE;
for (int64_t t = 0; t < T; t++) {
size_t t_offset = t * t_stride;
size_t state_offset = head_size * C * (t / (T / n_seqs));
float * state_cur = state + state_offset;
float * state_prev = t % (T / n_seqs) ? state_cur : (float*)dst->src[4]->data + state_offset;
for (int64_t h = h_start; h < h_end; h++) {
size_t h_offset = h * h_stride;
size_t t_h_offset = t_offset + h_offset;
size_t h_2d_offset = h * h_stride_2d;
for (int64_t i = 0; i < head_size; i++) {
size_t t_h_i_offset = t_h_offset + i;
size_t h_2d_i_offset = h_2d_offset + i * h_stride;
float k_val = k[t_h_i_offset];
float q_val = q[t_h_i_offset] * scale;
float g_val = g[t_h_i_offset];
// Broadcast scalar values to vectors
GGML_F32X k_vec = GGML_F32X_SET1(k_val);
GGML_F32X q_vec = GGML_F32X_SET1(q_val);
GGML_F32X g_vec = GGML_F32X_SET1(g_val);
for (int64_t j = 0; j < vec_count; j++) {
size_t base_j = j * GLA_VECTOR_SIZE;
size_t t_h_j_offset = t_h_offset + base_j;
size_t h_2d_i_j_offset = h_2d_i_offset + base_j;
// Load x elements at once
GGML_F32X v_vec = GGML_F32X_LOAD(&v[t_h_j_offset]);
GGML_F32X prev_state_vec = GGML_F32X_LOAD(&state_prev[h_2d_i_j_offset]);
GGML_F32X dst_vec = GGML_F32X_LOAD(&dst_data[t_h_j_offset]);
// Compute kv = v * k
GGML_F32X kv_vec = GGML_F32X_MUL(v_vec, k_vec);
// Compute temp = prev_state * g + kv
GGML_F32X temp_vec = GGML_F32X_FMA(kv_vec, prev_state_vec, g_vec);
// Update dst: dst += temp * q
dst_vec = GGML_F32X_FMA(dst_vec, temp_vec, q_vec);
GGML_F32X_STORE(&dst_data[t_h_j_offset], dst_vec);
// Update state
GGML_F32X_STORE(&state_cur[h_2d_i_j_offset], temp_vec);
}
// Handle remaining elements, this will not be used.
for (int64_t j = vec_count * GLA_VECTOR_SIZE; j < head_size; j++) {
size_t t_h_j_offset = t_h_offset + j;
size_t h_2d_i_j_offset = h_2d_i_offset + j;
float v_val = v[t_h_j_offset];
float kv_val = v_val * k_val;
float prev_state_val = state_prev[h_2d_i_j_offset];
float temp_val = kv_val + prev_state_val * g_val;
dst_data[t_h_j_offset] += temp_val * q_val;
state_cur[h_2d_i_j_offset] = temp_val;
}
}
}
}
#else
for (int64_t t = 0; t < T; t++) {
size_t t_offset = t * t_stride;
size_t state_offset = head_size * C * (t / (T / n_seqs));
float * state_cur = state + state_offset;
float * state_prev = t % (T / n_seqs) ? state_cur : (float*)dst->src[4]->data + state_offset;
for (int64_t h = h_start; h < h_end; h++) {
size_t h_offset = h * h_stride;
size_t t_h_offset = t_offset + h_offset;
size_t h_2d_offset = h * h_stride_2d;
for (int64_t i = 0; i < head_size; i++) {
size_t t_h_i_offset = t_h_offset + i;
size_t h_2d_i_offset = h_2d_offset + i * h_stride;
float k_val = k[t_h_i_offset];
float q_val = q[t_h_i_offset] * scale;
float g_val = g[t_h_i_offset];
for (int64_t j = 0; j < head_size; j++) {
size_t t_h_j_offset = t_h_offset + j;
size_t h_2d_i_j_offset = h_2d_i_offset + j;
float v_val = v[t_h_j_offset];
float kv_val = v_val * k_val;
float prev_state_val = state_prev[h_2d_i_j_offset];
float temp_val = prev_state_val * g_val + kv_val;
dst_data[t_h_j_offset] += temp_val * q_val;
state_cur[h_2d_i_j_offset] = temp_val;
}
}
}
}
#endif
}
static void ggml_compute_forward_gla(
const struct ggml_compute_params * params,
struct ggml_tensor * dst) {
const struct ggml_tensor * src0 = dst->src[0];
switch (src0->type) {
case GGML_TYPE_F32:
{
ggml_compute_forward_gla_f32(params, dst);
} break;
default:
{
GGML_ABORT("fatal error");
}
}
}
// ggml_compute_forward_map_unary
static void ggml_compute_forward_map_unary_f32(
@@ -12749,6 +12940,10 @@ static void ggml_compute_forward(struct ggml_compute_params * params, struct ggm
{
ggml_compute_forward_rwkv_wkv6(params, tensor);
} break;
case GGML_OP_GATED_LINEAR_ATTN:
{
ggml_compute_forward_gla(params, tensor);
} break;
case GGML_OP_MAP_UNARY:
{
ggml_unary_op_f32_t fun;
@@ -13047,6 +13242,7 @@ static int ggml_get_n_tasks(struct ggml_tensor * node, int n_threads) {
case GGML_OP_WIN_UNPART:
case GGML_OP_GET_REL_POS:
case GGML_OP_RWKV_WKV6:
case GGML_OP_GATED_LINEAR_ATTN:
case GGML_OP_MAP_UNARY:
case GGML_OP_MAP_BINARY:
case GGML_OP_MAP_CUSTOM1_F32:
+770 -66
View File
@@ -54,6 +54,7 @@
#include "ggml-quants.h"
#include <atomic>
#include <array>
#ifdef _MSC_VER
#define NOINLINE __declspec(noinline)
@@ -1051,6 +1052,704 @@ class tinyBLAS_Q0_AVX {
} \
} \
template <typename TA, typename TB, typename TC>
class tinyBLAS_Q0_PPC {
public:
tinyBLAS_Q0_PPC(int64_t k,
const TA *A, int64_t lda,
const TB *B, int64_t ldb,
TC *C, int64_t ldc,
int ith, int nth)
: A(A), B(B), C(C), k(k), lda(lda), ldb(ldb), ldc(ldc), ith(ith), nth(nth) {
}
void matmul(int64_t m, int64_t n) {
mnpack(0, m, 0, n);
}
private:
template<int RM, int RN>
inline void save_res(int ii, int jj, int idx, vector float* fin_res) {
for (int I = 0; I < RM; I++) {
for (int J = 0; J < RN; J++) {
*((float*)(C+ii+((jj+J)*ldc)+I)) = *((float*)&fin_res[idx+I]+J);
}
}
}
template<int size>
inline void compute(acc_t* ACC, int c_idx, int s_idx, std::array<int, size>& comparray, vector float* vs, vector float* fin_res) {
vector signed int vec_C[4];
vector float CA[4] = {0};
vector float res[4] = {0};
__builtin_mma_disassemble_acc(vec_C, ACC);
for (int i = 0; i < 4; i++) {
CA[i] = vec_splats((float)(((double)comparray[c_idx+i]) * -128.0));
res[i] = vec_add(vec_ctf(vec_C[i], 0), CA[i]);
fin_res[s_idx+i] = vec_madd(res[i], vs[s_idx+i], fin_res[s_idx+i]);
}
}
template<typename VA, typename VB>
void packNormal(const TA* a, int64_t lda, int rows, int cols, VA* vec, bool flip) {
int64_t i, j;
TA *aoffset = NULL;
VA *vecOffset = NULL;
TA *aoffset1 = NULL, *aoffset2 = NULL, *aoffset3 = NULL, *aoffset4 = NULL;
TA *aoffset5 = NULL, *aoffset6 = NULL, *aoffset7 = NULL, *aoffset8 = NULL;
__vector_pair C1, C2, C3, C4, C5, C6, C7, C8;
VB c1[2] = {0}, c2[2] = {0}, c3[2] = {0}, c4[2]={0};
VB c5[2] = {0}, c6[2] = {0}, c7[2] = {0}, c8[2]={0};
VB t1, t2, t3, t4, t5, t6, t7, t8;
vector unsigned char xor_vector;
uint8_t flip_vec = 0x80;
xor_vector = vec_splats(flip_vec);
vector unsigned char swiz1 = {0, 1, 2, 3, 4, 5, 6, 7, 16, 17, 18, 19, 20, 21, 22, 23};
vector unsigned char swiz2 = {8, 9, 10, 11, 12, 13, 14, 15, 24, 25, 26, 27, 28, 29, 30, 31};
vector unsigned char swiz3 = {0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27};
vector unsigned char swiz4 = {4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31};
aoffset = const_cast<TA*>(a);
vecOffset = vec;
j = (rows >> 3);
if (j > 0) {
do {
aoffset1 = aoffset;
aoffset2 = aoffset1 + lda;
aoffset3 = aoffset2 + lda;
aoffset4 = aoffset3 + lda;
aoffset5 = aoffset4 + lda;
aoffset6 = aoffset5 + lda;
aoffset7 = aoffset6 + lda;
aoffset8 = aoffset7 + lda;
aoffset += 8 * lda;
i = (cols >> 3);
if (i > 0) {
do {
C1 = __builtin_vsx_lxvp(0, (__vector_pair*)aoffset1->qs);
C2 = __builtin_vsx_lxvp(0, (__vector_pair*)aoffset2->qs);
C3 = __builtin_vsx_lxvp(0, (__vector_pair*)aoffset3->qs);
C4 = __builtin_vsx_lxvp(0, (__vector_pair*)aoffset4->qs);
C5 = __builtin_vsx_lxvp(0, (__vector_pair*)aoffset5->qs);
C6 = __builtin_vsx_lxvp(0, (__vector_pair*)aoffset6->qs);
C7 = __builtin_vsx_lxvp(0, (__vector_pair*)aoffset7->qs);
C8 = __builtin_vsx_lxvp(0, (__vector_pair*)aoffset8->qs);
__builtin_vsx_disassemble_pair(c1, &C1);
__builtin_vsx_disassemble_pair(c2, &C2);
__builtin_vsx_disassemble_pair(c3, &C3);
__builtin_vsx_disassemble_pair(c4, &C4);
__builtin_vsx_disassemble_pair(c5, &C5);
__builtin_vsx_disassemble_pair(c6, &C6);
__builtin_vsx_disassemble_pair(c7, &C7);
__builtin_vsx_disassemble_pair(c8, &C8);
t1 = vec_perm(c1[0], c2[0], swiz1);
t2 = vec_perm(c1[0], c2[0], swiz2);
t3 = vec_perm(c3[0], c4[0], swiz1);
t4 = vec_perm(c3[0], c4[0], swiz2);
t5 = vec_perm(t1, t3, swiz3);
t6 = vec_perm(t1, t3, swiz4);
t7 = vec_perm(t2, t4, swiz3);
t8 = vec_perm(t2, t4, swiz4);
if (flip == true) {
t5 = vec_xor(t5, xor_vector);
t6 = vec_xor(t6, xor_vector);
t7 = vec_xor(t7, xor_vector);
t8 = vec_xor(t8, xor_vector);
}
vec_xst(t5, 0, vecOffset);
vec_xst(t6, 0, vecOffset+16);
vec_xst(t7, 0, vecOffset+32);
vec_xst(t8, 0, vecOffset+48);
t1 = vec_perm(c1[1], c2[1], swiz1);
t2 = vec_perm(c1[1], c2[1], swiz2);
t3 = vec_perm(c3[1], c4[1], swiz1);
t4 = vec_perm(c3[1], c4[1], swiz2);
t5 = vec_perm(t1, t3, swiz3);
t6 = vec_perm(t1, t3, swiz4);
t7 = vec_perm(t2, t4, swiz3);
t8 = vec_perm(t2, t4, swiz4);
if (flip == true) {
t5 = vec_xor(t5, xor_vector);
t6 = vec_xor(t6, xor_vector);
t7 = vec_xor(t7, xor_vector);
t8 = vec_xor(t8, xor_vector);
}
vec_xst(t5, 0, vecOffset+64);
vec_xst(t6, 0, vecOffset+80);
vec_xst(t7, 0, vecOffset+96);
vec_xst(t8, 0, vecOffset+112);
t1 = vec_perm(c5[0], c6[0], swiz1);
t2 = vec_perm(c5[0], c6[0], swiz2);
t3 = vec_perm(c7[0], c8[0], swiz1);
t4 = vec_perm(c7[0], c8[0], swiz2);
t5 = vec_perm(t1, t3, swiz3);
t6 = vec_perm(t1, t3, swiz4);
t7 = vec_perm(t2, t4, swiz3);
t8 = vec_perm(t2, t4, swiz4);
if (flip == true) {
t5 = vec_xor(t5, xor_vector);
t6 = vec_xor(t6, xor_vector);
t7 = vec_xor(t7, xor_vector);
t8 = vec_xor(t8, xor_vector);
}
vec_xst(t5, 0, vecOffset+128);
vec_xst(t6, 0, vecOffset+144);
vec_xst(t7, 0, vecOffset+160);
vec_xst(t8, 0, vecOffset+176);
t1 = vec_perm(c5[1], c6[1], swiz1);
t2 = vec_perm(c5[1], c6[1], swiz2);
t3 = vec_perm(c7[1], c8[1], swiz1);
t4 = vec_perm(c7[1], c8[1], swiz2);
t5 = vec_perm(t1, t3, swiz3);
t6 = vec_perm(t1, t3, swiz4);
t7 = vec_perm(t2, t4, swiz3);
t8 = vec_perm(t2, t4, swiz4);
if (flip == true) {
t5 = vec_xor(t5, xor_vector);
t6 = vec_xor(t6, xor_vector);
t7 = vec_xor(t7, xor_vector);
t8 = vec_xor(t8, xor_vector);
}
vec_xst(t5, 0, vecOffset+192);
vec_xst(t6, 0, vecOffset+208);
vec_xst(t7, 0, vecOffset+224);
vec_xst(t8, 0, vecOffset+240);
aoffset1 += lda;
aoffset2 += lda;
aoffset3 += lda;
aoffset4 += lda;
aoffset5 += lda;
aoffset6 += lda;
aoffset7 += lda;
aoffset8 += lda;
vecOffset += 256;
i--;
} while(i > 0);
}
j--;
} while(j > 0);
}
if (rows & 4) {
aoffset1 = aoffset;
aoffset2 = aoffset1 + lda;
aoffset3 = aoffset2 + lda;
aoffset4 = aoffset3 + lda;
aoffset += 4 * lda;
i = (cols >> 3);
if (i > 0) {
do {
C1 = __builtin_vsx_lxvp(0, (__vector_pair*)aoffset1->qs);
C2 = __builtin_vsx_lxvp(0, (__vector_pair*)aoffset2->qs);
C3 = __builtin_vsx_lxvp(0, (__vector_pair*)aoffset3->qs);
C4 = __builtin_vsx_lxvp(0, (__vector_pair*)aoffset4->qs);
__builtin_vsx_disassemble_pair(c1, &C1);
__builtin_vsx_disassemble_pair(c2, &C2);
__builtin_vsx_disassemble_pair(c3, &C3);
__builtin_vsx_disassemble_pair(c4, &C4);
t1 = vec_perm(c1[0], c2[0], swiz1);
t2 = vec_perm(c1[0], c2[0], swiz2);
t3 = vec_perm(c3[0], c4[0], swiz1);
t4 = vec_perm(c3[0], c4[0], swiz2);
t5 = vec_perm(t1, t3, swiz3);
t6 = vec_perm(t1, t3, swiz4);
t7 = vec_perm(t2, t4, swiz3);
t8 = vec_perm(t2, t4, swiz4);
if (flip == true) {
t5 = vec_xor(t5, xor_vector);
t6 = vec_xor(t6, xor_vector);
t7 = vec_xor(t7, xor_vector);
t8 = vec_xor(t8, xor_vector);
}
vec_xst(t5, 0, vecOffset);
vec_xst(t6, 0, vecOffset+16);
vec_xst(t7, 0, vecOffset+32);
vec_xst(t8, 0, vecOffset+48);
t1 = vec_perm(c1[1], c2[1], swiz1);
t2 = vec_perm(c1[1], c2[1], swiz2);
t3 = vec_perm(c3[1], c4[1], swiz1);
t4 = vec_perm(c3[1], c4[1], swiz2);
t5 = vec_perm(t1, t3, swiz3);
t6 = vec_perm(t1, t3, swiz4);
t7 = vec_perm(t2, t4, swiz3);
t8 = vec_perm(t2, t4, swiz4);
if (flip == true) {
t5 = vec_xor(t5, xor_vector);
t6 = vec_xor(t6, xor_vector);
t7 = vec_xor(t7, xor_vector);
t8 = vec_xor(t8, xor_vector);
}
vec_xst(t5, 0, vecOffset+64);
vec_xst(t6, 0, vecOffset+80);
vec_xst(t7, 0, vecOffset+96);
vec_xst(t8, 0, vecOffset+112);
aoffset1 += lda;
aoffset2 += lda;
aoffset3 += lda;
aoffset4 += lda;
vecOffset += 128;
i--;
} while(i > 0);
}
}
if (rows & 3) {
aoffset1 = aoffset;
aoffset2 = aoffset1 + lda;
aoffset3 = aoffset2 + lda;
i = (cols >> 3);
if (i > 0) {
do {
switch(rows) {
case 3: C3 = __builtin_vsx_lxvp(0, (__vector_pair*)aoffset3->qs);
__builtin_vsx_disassemble_pair(c3, &C3);
case 2: C2 = __builtin_vsx_lxvp(0, (__vector_pair*)aoffset2->qs);
__builtin_vsx_disassemble_pair(c2, &C2);
case 1: C1 = __builtin_vsx_lxvp(0, (__vector_pair*)aoffset1->qs);
__builtin_vsx_disassemble_pair(c1, &C1);
break;
}
t1 = vec_perm(c1[0], c2[0], swiz1);
t2 = vec_perm(c1[0], c2[0], swiz2);
t3 = vec_perm(c3[0], c4[0], swiz1);
t4 = vec_perm(c3[0], c4[0], swiz2);
t5 = vec_perm(t1, t3, swiz3);
t6 = vec_perm(t1, t3, swiz4);
t7 = vec_perm(t2, t4, swiz3);
t8 = vec_perm(t2, t4, swiz4);
if (flip == true) {
t5 = vec_xor(t5, xor_vector);
t6 = vec_xor(t6, xor_vector);
t7 = vec_xor(t7, xor_vector);
t8 = vec_xor(t8, xor_vector);
}
vec_xst(t5, 0, vecOffset);
vec_xst(t6, 0, vecOffset+16);
vec_xst(t7, 0, vecOffset+32);
vec_xst(t8, 0, vecOffset+48);
t1 = vec_perm(c1[1], c2[1], swiz1);
t2 = vec_perm(c1[1], c2[1], swiz2);
t3 = vec_perm(c3[1], c4[1], swiz1);
t4 = vec_perm(c3[1], c4[1], swiz2);
t5 = vec_perm(t1, t3, swiz3);
t6 = vec_perm(t1, t3, swiz4);
t7 = vec_perm(t2, t4, swiz3);
t8 = vec_perm(t2, t4, swiz4);
if (flip == true) {
t5 = vec_xor(t5, xor_vector);
t6 = vec_xor(t6, xor_vector);
t7 = vec_xor(t7, xor_vector);
t8 = vec_xor(t8, xor_vector);
}
vec_xst(t5, 0, vecOffset+64);
vec_xst(t6, 0, vecOffset+80);
vec_xst(t7, 0, vecOffset+96);
vec_xst(t8, 0, vecOffset+112);
aoffset1 += lda;
aoffset2 += lda;
aoffset3 += lda;
vecOffset += 128;
i--;
} while(i > 0);
}
}
}
void mnpack(int64_t m0, int64_t m, int64_t n0, int64_t n) {
int64_t mc, nc, mp, np;
int m_rem = MIN(m - m0, 8);
int n_rem = MIN(n - n0, 8);
// TO-DO: KERNEL_16x8 and KERNEL_8x16 are having some performance
// issues. After resolving them, below code will be enabled.
/*if (m_rem >= 16 && n_rem >= 8) {
mc = 16;
nc = 8;
gemm<16,8>(m0, m, n0, n);
} else if(m_rem >= 8 && n_rem >= 16) {
mc = 8;
nc = 16;
gemm<8,16>(m0, m, n0, n);
}*/
if (m_rem >= 8 && n_rem >= 8) {
mc = 8;
nc = 8;
gemm<8,8>(m0, m, n0, n);
} else if (m_rem >= 4 && n_rem >= 8) {
mc = 4;
nc = 8;
gemm<4,8>(m0, m, n0, n);
} else if (m_rem >= 8 && n_rem >= 4) {
mc = 8;
nc = 4;
gemm<8,4>(m0, m, n0, n);
} else if (m_rem >= 4 && n_rem >= 4) {
mc = 4;
nc = 4;
gemm_small<4, 4>(m0, m, n0, n);
} else if ((m_rem < 4) && (n_rem > 4)) {
nc = 4;
switch(m_rem) {
case 1:
mc = 1;
gemm_small<1, 4>(m0, m, n0, n);
break;
case 2:
mc = 2;
gemm_small<2, 4>(m0, m, n0, n);
break;
case 3:
mc = 3;
gemm_small<3, 4>(m0, m, n0, n);
break;
default:
return;
}
} else if ((m_rem > 4) && (n_rem < 4)) {
mc = 4;
switch(n_rem) {
case 1:
nc = 1;
gemm_small<4, 1>(m0, m, n0, n);
break;
case 2:
nc = 2;
gemm_small<4, 2>(m0, m, n0, n);
break;
case 3:
nc = 3;
gemm_small<4, 3>(m0, m, n0, n);
break;
default:
return;
}
} else {
switch((m_rem << 4) | n_rem) {
case 0x43:
mc = 4;
nc = 3;
gemm_small<4, 3>(m0, m, n0, n);
break;
case 0x42:
mc = 4;
nc = 2;
gemm_small<4, 2>(m0, m, n0, n);
break;
case 0x41:
mc = 4;
nc = 1;
gemm_small<4, 1>(m0, m, n0, n);
break;
case 0x34:
mc = 3;
nc = 4;
gemm_small<3, 4>(m0, m, n0, n);
break;
case 0x33:
mc = 3;
nc = 3;
gemm_small<3, 3>(m0, m, n0, n);
break;
case 0x32:
mc = 3;
nc = 2;
gemm_small<3, 2>(m0, m, n0, n);
break;
case 0x31:
mc = 3;
nc = 1;
gemm_small<3, 1>(m0, m, n0, n);
break;
case 0x24:
mc = 2;
nc = 4;
gemm_small<2, 4>(m0, m, n0, n);
break;
case 0x23:
mc = 2;
nc = 3;
gemm_small<2, 3>(m0, m, n0, n);
break;
case 0x22:
mc = 2;
nc = 2;
gemm_small<2, 2>(m0, m, n0, n);
break;
case 0x21:
mc = 2;
nc = 1;
gemm_small<2, 1>(m0, m, n0, n);
break;
case 0x14:
mc = 1;
nc = 4;
gemm_small<1, 4>(m0, m, n0, n);
break;
case 0x13:
mc = 1;
nc = 3;
gemm_small<1, 3>(m0, m, n0, n);
break;
case 0x12:
mc = 1;
nc = 2;
gemm_small<1, 2>(m0, m, n0, n);
break;
case 0x11:
mc = 1;
nc = 1;
gemm_small<1, 1>(m0, m, n0, n);
break;
default:
return;
}
}
mp = m0 + (m - m0) / mc * mc;
np = n0 + (n - n0) / nc * nc;
mnpack(mp, m, n0, np);
mnpack(m0, m, np, n);
}
void KERNEL_4x8(int64_t ii, int64_t jj) {
vec_t vec_A[8], vec_B[16] = {0};
acc_t acc_0, acc_1;
std::array<int, 4> comparray;
vector float fin_res[8] = {0};
vector float vs[8] = {0};
for (int l = 0; l < k; l++) {
__builtin_mma_xxsetaccz(&acc_0);
__builtin_mma_xxsetaccz(&acc_1);
packNormal<int8_t, vector signed char>((A+(ii*lda)+l), lda, 4, 8, (int8_t*)vec_A, false);
packNormal<uint8_t, vector unsigned char>((B+(jj*ldb)+l), ldb, 8, 8, (uint8_t*)vec_B, true);
for(int x = 0; x < 8; x++) {
__builtin_mma_xvi8ger4pp(&acc_0, vec_A[x], vec_B[x]);
__builtin_mma_xvi8ger4pp(&acc_1, vec_A[x], vec_B[x+8]);
}
for (int I = 0; I<4; I++) {
for (int J = 0; J<4; J++) {
*((float*)&vs[I]+J) = (unhalf((A+((ii+I)*lda)+l)->d) * unhalf((B+((jj+J)*ldb)+l)->d));
*((float*)&vs[I+4]+J) = (unhalf((A+((ii+I)*lda)+l)->d) * unhalf((B+((jj+J+4)*ldb)+l)->d));
}
}
auto aoffset = A+(ii*lda)+l;
for (int i = 0; i < 4; i++) {
comparray[i] = 0;
int ca = 0;
const int8_t *at = aoffset->qs;
for (int j = 0; j < 32; j++)
ca += (int)*at++;
comparray[i] = ca;
aoffset += lda;
}
compute<4>(&acc_0, 0, 0, comparray, vs, fin_res);
compute<4>(&acc_1, 0, 4, comparray, vs, fin_res);
}
save_res<4, 4>(ii, jj, 0, fin_res);
save_res<4, 4>(ii, jj+4, 4, fin_res);
}
void KERNEL_8x4(int64_t ii, int64_t jj) {
vec_t vec_A[16], vec_B[8] = {0};
acc_t acc_0, acc_1;
std::array<int, 8> comparray;
vector float fin_res[8] = {0};
vector float vs[8] = {0};
for (int l = 0; l < k; l++) {
__builtin_mma_xxsetaccz(&acc_0);
__builtin_mma_xxsetaccz(&acc_1);
packNormal<int8_t, vector signed char>((A+(ii*lda)+l), lda, 8, 8, (int8_t*)vec_A, false);
packNormal<uint8_t, vector unsigned char>((B+(jj*ldb)+l), ldb, 4, 8, (uint8_t*)vec_B, true);
for(int x = 0; x < 8; x++) {
__builtin_mma_xvi8ger4pp(&acc_0, vec_A[x], vec_B[x]);
__builtin_mma_xvi8ger4pp(&acc_1, vec_A[x+8], vec_B[x]);
}
for (int I = 0; I<8; I++) {
for (int J = 0; J<4; J++) {
*((float*)&vs[I]+J) = (unhalf((A+((ii+I)*lda)+l)->d) * unhalf((B+((jj+J)*ldb)+l)->d));
}
}
auto aoffset = A+(ii*lda)+l;
for (int i = 0; i < 8; i++) {
comparray[i] = 0;
int ca = 0;
const int8_t *at = aoffset->qs;
for (int j = 0; j < 32; j++)
ca += (int)*at++;
comparray[i] = ca;
aoffset += lda;
}
compute<8>(&acc_0, 0, 0, comparray, vs, fin_res);
compute<8>(&acc_1, 4, 4, comparray, vs, fin_res);
}
save_res<4, 4>(ii, jj, 0, fin_res);
save_res<4, 4>(ii+4, jj, 4, fin_res);
}
void KERNEL_8x8(int64_t ii, int64_t jj) {
vec_t vec_A[16], vec_B[16] = {0};
acc_t acc_0, acc_1, acc_2, acc_3;
std::array<int, 8> comparray;
vector float fin_res[16] = {0};
vector float vs[16] = {0};
for (int l = 0; l < k; l++) {
__builtin_mma_xxsetaccz(&acc_0);
__builtin_mma_xxsetaccz(&acc_1);
__builtin_mma_xxsetaccz(&acc_2);
__builtin_mma_xxsetaccz(&acc_3);
packNormal<int8_t, vector signed char>((A+(ii*lda)+l), lda, 8, 8, (int8_t*)vec_A, false);
packNormal<uint8_t, vector unsigned char>((B+(jj*ldb)+l), ldb, 8, 8, (uint8_t*)vec_B, true);
for(int x = 0; x < 8; x++) {
__builtin_mma_xvi8ger4pp(&acc_0, vec_A[x], vec_B[x]);
__builtin_mma_xvi8ger4pp(&acc_1, vec_A[x+8], vec_B[x]);
__builtin_mma_xvi8ger4pp(&acc_2, vec_A[x], vec_B[x+8]);
__builtin_mma_xvi8ger4pp(&acc_3, vec_A[x+8], vec_B[x+8]);
}
for (int I = 0; I<8; I++) {
for (int J = 0; J<4; J++) {
*((float*)&vs[I]+J) = (unhalf((A+((ii+I)*lda)+l)->d) * unhalf((B+((jj+J)*ldb)+l)->d));
*((float*)&vs[I+8]+J) = (unhalf((A+((ii+I)*lda)+l)->d) * unhalf((B+((jj+J+4)*ldb)+l)->d));
}
}
auto aoffset = A+(ii*lda)+l;
for (int i = 0; i < 8; i++) {
comparray[i] = 0;
int ca = 0;
const int8_t *at = aoffset->qs;
for (int j = 0; j < 32; j++)
ca += (int)*at++;
comparray[i] = ca;
aoffset += lda;
}
compute<8>(&acc_0, 0, 0, comparray, vs, fin_res);
compute<8>(&acc_1, 4, 4, comparray, vs, fin_res);
compute<8>(&acc_2, 0, 8, comparray, vs, fin_res);
compute<8>(&acc_3, 4, 12, comparray, vs, fin_res);
}
save_res<4, 4>(ii, jj, 0, fin_res);
save_res<4, 4>(ii+4, jj, 4, fin_res);
save_res<4, 4>(ii, jj+4, 8, fin_res);
save_res<4, 4>(ii+4, jj+4, 12, fin_res);
}
template<int RM, int RN>
void gemm_small(int64_t m0, int64_t m, int64_t n0, int64_t n) {
int64_t ytiles = (m - m0) / RM;
int64_t xtiles = (n - n0) / RN;
int64_t tiles = xtiles * ytiles;
int64_t duty = (tiles + nth - 1) / nth;
int64_t start = duty * ith;
int64_t end = start + duty;
vec_t vec_A[8], vec_B[8] = {0};
vector signed int vec_C[4];
acc_t acc_0;
if (end > tiles)
end = tiles;
for (int64_t job = start; job < end; ++job) {
int64_t ii = m0 + job / xtiles * RM;
int64_t jj = n0 + job % xtiles * RN;
std::array<int, RM> comparray;
vector float res[4] = {0};
vector float fin_res[4] = {0};
vector float vs[4] = {0};
vector float CA[4] = {0};
__builtin_prefetch((A+(ii*lda)+0)->qs, 0, 1); // prefetch first value
__builtin_prefetch((B+(jj*ldb)+0)->qs, 0, 1); // prefetch first value
for (int l = 0; l < k; l++) {
__builtin_prefetch((A+(ii*lda)+(l+1))->qs, 0, 1); // prefetch one loop ahead
__builtin_prefetch((B+(jj*ldb)+(l+1))->qs, 0, 1); // prefetch one loop ahead
__builtin_mma_xxsetaccz(&acc_0);
packNormal<int8_t, vector signed char>((A+(ii*lda)+l), lda, RM, 8, (int8_t*)vec_A, false);
packNormal<uint8_t, vector unsigned char>((B+(jj*ldb)+l), ldb, RN, 8, (uint8_t*)vec_B, true);
for(int x = 0; x < 8; x+=4) {
__builtin_mma_xvi8ger4pp(&acc_0, vec_A[x], vec_B[x]);
__builtin_mma_xvi8ger4pp(&acc_0, vec_A[x+1], vec_B[x+1]);
__builtin_mma_xvi8ger4pp(&acc_0, vec_A[x+2], vec_B[x+2]);
__builtin_mma_xvi8ger4pp(&acc_0, vec_A[x+3], vec_B[x+3]);
}
for (int I = 0; I<RM; I++) {
for (int J = 0; J<RN; J++) {
*((float*)&vs[I]+J) = (unhalf((A+((ii+I)*lda)+l)->d) * unhalf((B+((jj+J)*ldb)+l)->d));
}
}
__builtin_mma_disassemble_acc(vec_C, &acc_0);
auto aoffset = A+(ii*lda)+l;
for (int i = 0; i < RM; i++) {
comparray[i] = 0;
int ca = 0;
const int8_t *at = aoffset->qs;
for (int j = 0; j < 32; j++)
ca += (int)*at++;
comparray[i] = ca;
aoffset += lda;
}
for (int i = 0; i < RM; i++) {
CA[i] = vec_splats((float)(((double)comparray[i]) * -128.0));
res[i] = vec_add(vec_ctf(vec_C[i], 0), CA[i]);
fin_res[i] = vec_madd(res[i], vs[i], fin_res[i]);
}
}
save_res<RM, RN>(ii, jj, 0, fin_res);
}
}
template<int RM, int RN>
inline void kernel(int64_t ii, int64_t jj) {
if constexpr(RM == 4 && RN == 8) {
KERNEL_4x8(ii,jj);
} else if constexpr(RM == 8 && RN == 4) {
KERNEL_8x4(ii,jj);
} else if constexpr(RM == 8 && RN == 8) {
KERNEL_8x8(ii,jj);
} else {
static_assert(false, "RN/RM values not supported");
}
}
template <int RM, int RN>
NOINLINE void gemm(int64_t m0, int64_t m, int64_t n0, int64_t n) {
int64_t ytiles = (m - m0) / RM;
int64_t xtiles = (n - n0) / RN;
int64_t tiles = xtiles * ytiles;
int64_t duty = (tiles + nth - 1) / nth;
int64_t start = duty * ith;
int64_t end = start + duty;
if (end > tiles)
end = tiles;
for (int64_t job = start; job < end; ++job) {
int64_t ii = m0 + job / xtiles * RM;
int64_t jj = n0 + job % xtiles * RN;
kernel<RM, RN>(ii, jj);
}
}
const TA *const A;
const TB *const B;
TC *C;
TA *At;
TB *Bt;
const int64_t k;
const int64_t lda;
const int64_t ldb;
const int64_t ldc;
const int ith;
const int nth;
};
template <typename TA, typename TB, typename TC>
class tinyBLAS_PPC {
public:
@@ -1070,13 +1769,17 @@ class tinyBLAS_PPC {
void (tinyBLAS_PPC::*kernel)(int64_t, int64_t);
void READ_BLOCK(const float* a, int64_t lda, int rows, int cols, float* vec) {
template<typename VA>
void packTranspose(const TA* a, int64_t lda, int rows, int cols, TA* vec) {
int64_t i, j;
float *aoffset = NULL, *boffset = NULL;
float *aoffset1 = NULL, *aoffset2 = NULL, *aoffset3 = NULL, *aoffset4 = NULL;
float *aoffset5 = NULL, *aoffset6 = NULL, *aoffset7 = NULL, *aoffset8 = NULL;
aoffset = const_cast<float*>(a);
TA *aoffset = NULL, *boffset = NULL;
TA *aoffset1 = NULL, *aoffset2 = NULL, *aoffset3 = NULL, *aoffset4 = NULL;
TA *aoffset5 = NULL, *aoffset6 = NULL, *aoffset7 = NULL, *aoffset8 = NULL;
__vector_pair C1, C2, C3, C4, C5, C6, C7, C8;
VA c1[2] = {0}, c2[2] = {0}, c3[2] = {0}, c4[2] = {0};
VA c5[2] = {0}, c6[2] = {0}, c7[2] = {0}, c8[2] = {0};
VA t1, t2, t3, t4, t5, t6, t7, t8;
aoffset = const_cast<TA*>(a);
boffset = vec;
j = (rows >> 3);
if (j > 0) {
@@ -1092,9 +1795,6 @@ class tinyBLAS_PPC {
aoffset += 8 * lda;
i = (cols >> 3);
if (i > 0) {
__vector_pair C1, C2, C3, C4, C5, C6, C7, C8;
vector float c1[2], c2[2], c3[2], c4[2], c5[2], c6[2], c7[2], c8[2];
vector float t1, t2, t3, t4, t5, t6, t7, t8;
do {
C1 = __builtin_vsx_lxvp(0, (__vector_pair*)aoffset1);
C2 = __builtin_vsx_lxvp(0, (__vector_pair*)aoffset2);
@@ -1174,21 +1874,19 @@ class tinyBLAS_PPC {
} while(i > 0);
}
if (cols & 4) {
vector float c1, c2, c3, c4, c5, c6, c7, c8;
vector float t1, t2, t3, t4, t5, t6, t7, t8;
c1 = vec_xl(0, aoffset1);
c2 = vec_xl(0, aoffset2);
c3 = vec_xl(0, aoffset3);
c4 = vec_xl(0, aoffset4);
c5 = vec_xl(0, aoffset5);
c6 = vec_xl(0, aoffset6);
c7 = vec_xl(0, aoffset7);
c8 = vec_xl(0, aoffset8);
c1[0] = vec_xl(0, aoffset1);
c2[0] = vec_xl(0, aoffset2);
c3[0] = vec_xl(0, aoffset3);
c4[0] = vec_xl(0, aoffset4);
c5[0] = vec_xl(0, aoffset5);
c6[0] = vec_xl(0, aoffset6);
c7[0] = vec_xl(0, aoffset7);
c8[0] = vec_xl(0, aoffset8);
t1 = vec_mergeh(c1, c2);
t2 = vec_mergeh(c3, c4);
t3 = vec_mergeh(c5, c6);
t4 = vec_mergeh(c7, c8);
t1 = vec_mergeh(c1[0], c2[0]);
t2 = vec_mergeh(c3[0], c4[0]);
t3 = vec_mergeh(c5[0], c6[0]);
t4 = vec_mergeh(c7[0], c8[0]);
t5 = vec_xxpermdi(t1, t2, 0);
t6 = vec_xxpermdi(t3, t4, 0);
t7 = vec_xxpermdi(t1, t2, 3);
@@ -1198,10 +1896,10 @@ class tinyBLAS_PPC {
vec_xst(t7, 0, boffset+8);
vec_xst(t8, 0, boffset+12);
t1 = vec_mergel(c1, c2);
t2 = vec_mergel(c3, c4);
t3 = vec_mergel(c5, c6);
t4 = vec_mergel(c7, c8);
t1 = vec_mergel(c1[0], c2[0]);
t2 = vec_mergel(c3[0], c4[0]);
t3 = vec_mergel(c5[0], c6[0]);
t4 = vec_mergel(c7[0], c8[0]);
t5 = vec_xxpermdi(t1, t2, 0);
t6 = vec_xxpermdi(t3, t4, 0);
t7 = vec_xxpermdi(t1, t2, 3);
@@ -1223,9 +1921,6 @@ class tinyBLAS_PPC {
aoffset += 4 * lda;
i = (cols >> 3);
if (i > 0) {
__vector_pair C1, C2, C3, C4;
vector float c1[2], c2[2], c3[2], c4[2];
vector float t1, t2, t3, t4, t5, t6, t7, t8;
do {
C1 = __builtin_vsx_lxvp(0, (__vector_pair*)aoffset1);
C2 = __builtin_vsx_lxvp(0, (__vector_pair*)aoffset2);
@@ -1272,22 +1967,20 @@ class tinyBLAS_PPC {
}
if (cols & 4) {
vector float c1, c2, c3, c4;
vector float t1, t2, t3, t4;
c1 = vec_xl(0, aoffset1);
c2 = vec_xl(0, aoffset2);
c3 = vec_xl(0, aoffset3);
c4 = vec_xl(0, aoffset4);
c1[0] = vec_xl(0, aoffset1);
c2[0] = vec_xl(0, aoffset2);
c3[0] = vec_xl(0, aoffset3);
c4[0] = vec_xl(0, aoffset4);
t1 = vec_mergeh(c1, c2);
t2 = vec_mergeh(c3, c4);
t1 = vec_mergeh(c1[0], c2[0]);
t2 = vec_mergeh(c3[0], c4[0]);
t3 = vec_xxpermdi(t1, t2, 0);
t4 = vec_xxpermdi(t1, t2, 3);
vec_xst(t3, 0, boffset);
vec_xst(t4, 0, boffset+4);
t1 = vec_mergel(c1, c2);
t2 = vec_mergel(c3, c4);
t1 = vec_mergel(c1[0], c2[0]);
t2 = vec_mergel(c3[0], c4[0]);
t3 = vec_xxpermdi(t1, t2, 0);
t4 = vec_xxpermdi(t1, t2, 3);
vec_xst(t3, 0, boffset+8);
@@ -1299,21 +1992,19 @@ class tinyBLAS_PPC {
aoffset2 = aoffset1 + lda;
aoffset3 = aoffset2 + lda;
if (cols & 4) {
vector float c1, c2, c3, c4 = {0};
vector float t1, t2, t3, t4;
c1 = vec_xl(0, aoffset1);
c2 = vec_xl(0, aoffset2);
c3 = vec_xl(0, aoffset3);
c1[0] = vec_xl(0, aoffset1);
c2[0] = vec_xl(0, aoffset2);
c3[0] = vec_xl(0, aoffset3);
t1 = vec_mergeh(c1, c2);
t2 = vec_mergeh(c3, c4);
t1 = vec_mergeh(c1[0], c2[0]);
t2 = vec_mergeh(c3[0], c4[0]);
t3 = vec_xxpermdi(t1, t2, 0);
t4 = vec_xxpermdi(t1, t2, 3);
vec_xst(t3, 0, boffset);
vec_xst(t4, 0, boffset+4);
t1 = vec_mergel(c1, c2);
t2 = vec_mergel(c3, c4);
t1 = vec_mergel(c1[0], c2[0]);
t2 = vec_mergel(c3[0], c4[0]);
t3 = vec_xxpermdi(t1, t2, 0);
t4 = vec_xxpermdi(t1, t2, 3);
vec_xst(t3, 0, boffset+8);
@@ -1321,14 +2012,13 @@ class tinyBLAS_PPC {
}
}
}
void KERNEL_4x4(int64_t ii, int64_t jj) {
vec_t vec_A[4], vec_B[4], vec_C[4];
acc_t acc_0;
__builtin_mma_xxsetaccz(&acc_0);
for (int l = 0; l < k; l+=4) {
READ_BLOCK(A+(ii*lda)+l, lda, 4, 4, (float*)vec_A);
READ_BLOCK(B+(jj*ldb)+l, ldb, 4, 4, (float*)vec_B);
packTranspose<vector float>(A+(ii*lda)+l, lda, 4, 4, (TA*)vec_A);
packTranspose<vector float>(B+(jj*ldb)+l, ldb, 4, 4, (TA*)vec_B);
__builtin_mma_xvf32gerpp(&acc_0, vec_A[0], vec_B[0]);
__builtin_mma_xvf32gerpp(&acc_0, vec_A[1], vec_B[1]);
__builtin_mma_xvf32gerpp(&acc_0, vec_A[2], vec_B[2]);
@@ -1343,8 +2033,8 @@ class tinyBLAS_PPC {
__builtin_mma_xxsetaccz(&acc_0);
__builtin_mma_xxsetaccz(&acc_1);
for (int64_t l = 0; l < k; l+=4) {
READ_BLOCK(A+(ii*lda)+l, lda, 4, 4, (float*)vec_A);
READ_BLOCK(B+(jj*ldb)+l, ldb, 8, 4, (float*)vec_B);
packTranspose<vector float>(A+(ii*lda)+l, lda, 4, 4, (TA*)vec_A);
packTranspose<vector float>(B+(jj*ldb)+l, ldb, 8, 4, (TA*)vec_B);
__builtin_mma_xvf32gerpp(&acc_0, vec_A[0], (vec_t)vec_B[0]);
__builtin_mma_xvf32gerpp(&acc_1, vec_A[0], (vec_t)vec_B[1]);
__builtin_mma_xvf32gerpp(&acc_0, vec_A[1], (vec_t)vec_B[2]);
@@ -1364,8 +2054,8 @@ class tinyBLAS_PPC {
__builtin_mma_xxsetaccz(&acc_0);
__builtin_mma_xxsetaccz(&acc_1);
for (int64_t l = 0; l < k; l+=4) {
READ_BLOCK(A+(ii*lda)+l, lda, 8, 4, (float*)vec_A);
READ_BLOCK(B+(jj*ldb)+l, ldb, 4, 4, (float*)vec_B);
packTranspose<vector float>(A+(ii*lda)+l, lda, 8, 4, (TA*)vec_A);
packTranspose<vector float>(B+(jj*ldb)+l, ldb, 4, 4, (TA*)vec_B);
__builtin_mma_xvf32gerpp(&acc_0, (vec_t)vec_A[0], vec_B[0]);
__builtin_mma_xvf32gerpp(&acc_1, (vec_t)vec_A[1], vec_B[0]);
__builtin_mma_xvf32gerpp(&acc_0, (vec_t)vec_A[2], vec_B[1]);
@@ -1387,8 +2077,8 @@ class tinyBLAS_PPC {
__builtin_mma_xxsetaccz(&acc_2);
__builtin_mma_xxsetaccz(&acc_3);
for (int l = 0; l < k; l+=8) {
READ_BLOCK(A+(ii*lda)+l, lda, 8, 8, (float*)vec_A);
READ_BLOCK(B+(jj*ldb)+l, ldb, 8, 8, (float*)vec_B);
packTranspose<vector float>(A+(ii*lda)+l, lda, 8, 8, (TA*)vec_A);
packTranspose<vector float>(B+(jj*ldb)+l, ldb, 8, 8, (TA*)vec_B);
for(int x = 0; x < 16; x+=2) {
__builtin_mma_xvf32gerpp(&acc_0, (vec_t)vec_A[x], vec_B[x]);
__builtin_mma_xvf32gerpp(&acc_1, (vec_t)vec_A[x], vec_B[x+1]);
@@ -1571,15 +2261,15 @@ class tinyBLAS_PPC {
vec_t vec_A[4], vec_B[4];
for (int l=0; l<k; l+=4) {
if (RN >= 4 && RM == 1) {
float* a = const_cast<float*>(A+(ii)*lda+l);
READ_BLOCK(B+(jj*ldb)+l, ldb, 4, 4, (float*)vec_B);
TA* a = const_cast<TA*>(A+(ii)*lda+l);
packTranspose<vector float>(B+(jj*ldb)+l, ldb, 4, 4, (TA*)vec_B);
vec_A[0] = (vec_t)vec_xl(0,a);
vec_A[1] = (vec_t)vec_splats(*((float*)&vec_A+1));
vec_A[2] = (vec_t)vec_splats(*((float*)&vec_A+2));
vec_A[3] = (vec_t)vec_splats(*((float*)&vec_A+3));
vec_A[1] = (vec_t)vec_splats(*((TA*)&vec_A+1));
vec_A[2] = (vec_t)vec_splats(*((TA*)&vec_A+2));
vec_A[3] = (vec_t)vec_splats(*((TA*)&vec_A+3));
} else {
READ_BLOCK(A+(ii*lda)+l, lda, RM, 4, (float*)vec_A);
READ_BLOCK(B+(jj*ldb)+l, ldb, RN, 4, (float*)vec_B);
packTranspose<vector float>(A+(ii*lda)+l, lda, RM, 4, (TA*)vec_A);
packTranspose<vector float>(B+(jj*ldb)+l, ldb, RN, 4, (TA*)vec_B);
}
__builtin_mma_xvf32gerpp(&acc_0, vec_A[0], vec_B[0]);
__builtin_mma_xvf32gerpp(&acc_0, vec_A[1], vec_B[1]);
@@ -1589,7 +2279,7 @@ class tinyBLAS_PPC {
__builtin_mma_disassemble_acc(vec_C, &acc_0);
for (int I = 0; I < RM; I++) {
for (int J = 0; J < RN; J++) {
*((float*)(C+ii+((jj+J)*ldc)+I)) = *((float*)&vec_C[I]+J);
*((TC*)(C+ii+((jj+J)*ldc)+I)) = *((TC*)&vec_C[I]+J);
}
}
}
@@ -1812,6 +2502,20 @@ bool llamafile_sgemm(const struct ggml_compute_params * params, int64_t m, int64
params->ith, params->nth};
tb.matmul(m, n);
return true;
#elif defined(__MMA__)
if (n < 8 && n != 4)
return false;
if (m < 8 && m != 4)
return false;
tinyBLAS_Q0_PPC<block_q8_0, block_q8_0, float> tb{
k, (const block_q8_0 *)A, lda,
(const block_q8_0 *)B, ldb,
(float *)C, ldc,
params->ith, params->nth};
tb.matmul(m, n);
return true;
#else
return false;
#endif
+1 -1
View File
@@ -124,7 +124,7 @@ static __global__ void __launch_bounds__(CUDA_CONCAT_BLOCK_SIZE)
uint64_t nb1,
uint64_t nb2,
uint64_t nb3){
static_assert(dim >= 0 && dim <= 3);
static_assert(dim >= 0 && dim <= 3, "dim must be in [0, 3]");
const int64_t i3 = blockIdx.z;
const int64_t i2 = blockIdx.y;
+5
View File
@@ -37,6 +37,7 @@
#include "ggml-cuda/unary.cuh"
#include "ggml-cuda/upscale.cuh"
#include "ggml-cuda/wkv6.cuh"
#include "ggml-cuda/gla.cuh"
#include <algorithm>
#include <array>
@@ -2167,6 +2168,9 @@ static bool ggml_cuda_compute_forward(ggml_backend_cuda_context & ctx, struct gg
case GGML_OP_RWKV_WKV6:
ggml_cuda_op_rwkv_wkv6(ctx, dst);
break;
case GGML_OP_GATED_LINEAR_ATTN:
ggml_cuda_op_gated_linear_attn(ctx, dst);
break;
case GGML_OP_CROSS_ENTROPY_LOSS_BACK:
ggml_cuda_cross_entropy_loss_back(ctx, dst);
break;
@@ -3011,6 +3015,7 @@ static bool ggml_backend_cuda_device_supports_op(ggml_backend_dev_t dev, const g
case GGML_OP_TIMESTEP_EMBEDDING:
case GGML_OP_LEAKY_RELU:
case GGML_OP_RWKV_WKV6:
case GGML_OP_GATED_LINEAR_ATTN:
return true;
case GGML_OP_FLASH_ATTN_EXT: {
#ifndef FLASH_ATTN_AVAILABLE
+93
View File
@@ -0,0 +1,93 @@
#include "common.cuh"
#include "gla.cuh"
template<int HEAD_SIZE>
static __global__ void gated_linear_attn_f32(const int B, const int T, const int C, const int H, const float scale,
const float * k, const float * v, const float * r, const float * td, const float * s, float * dst) {
const int tid = threadIdx.x;
const int bid = blockIdx.x;
const int head_size = HEAD_SIZE;
const int batch_i = bid / H;
const int head_i = bid % H;
const int state_size = C * head_size;
const int n_seq_tokens = T / B;
float state[head_size];
__shared__ float _k[head_size], _r[head_size], _td[head_size];
#pragma unroll
for (int i = 0; i < head_size; i++) {
state[i] = s[batch_i * state_size + head_i * head_size * head_size + i * head_size + tid];
}
for (int t = batch_i * n_seq_tokens * C + head_i * head_size + tid; t < (batch_i + 1) * n_seq_tokens * C + head_i * head_size + tid; t += C) {
__syncthreads();
_k[tid] = k[t];
_r[tid] = r[t];
_td[tid] = td[t];
__syncthreads();
const float _v = v[t];
float y = 0;
for (int j = 0; j < head_size; j += 4) {
const float4 & k = (float4 &)(_k[j]);
const float4 & r = (float4 &)(_r[j]);
const float4 & td = (float4 &)(_td[j]);
float4 & s = (float4 &)(state[j]);
float4 kv;
kv.x = k.x * _v;
kv.y = k.y * _v;
kv.z = k.z * _v;
kv.w = k.w * _v;
s.x = s.x * td.x + kv.x;
s.y = s.y * td.y + kv.y;
s.z = s.z * td.z + kv.z;
s.w = s.w * td.w + kv.w;
y += r.x * s.x;
y += r.y * s.y;
y += r.z * s.z;
y += r.w * s.w;
}
dst[t] = y * scale;
}
#pragma unroll
for (int i = 0; i < head_size; i++) {
dst[T * C + batch_i * state_size + head_i * head_size * head_size + i * head_size + tid] = state[i];
}
}
void ggml_cuda_op_gated_linear_attn(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
const float * k_d = (const float *)dst->src[0]->data;
const float * v_d = (const float *)dst->src[1]->data;
const float * r_d = (const float *)dst->src[2]->data;
const float * td_d = (const float *)dst->src[3]->data;
const float * s_d = (const float *)dst->src[4]->data;
const int64_t B = dst->src[4]->ne[1];
const int64_t T = dst->src[0]->ne[2];
const int64_t C = dst->ne[0];
const int64_t H = dst->src[0]->ne[1];
float scale;
memcpy(&scale, (float*)dst->op_params, sizeof(float));
float * dst_d = (float *)dst->data;
cudaStream_t stream = ctx.stream();
GGML_ASSERT(dst->src[4]->type == GGML_TYPE_F32);
GGML_ASSERT(C % H == 0);
GGML_ASSERT(C / H == 64 || C / H == 128);
if (C / H == 64) {
gated_linear_attn_f32<64><<<B * H, C / H, 0, stream>>>(B, T, C, H, scale, k_d, v_d, r_d, td_d, s_d, dst_d);
} else {
gated_linear_attn_f32<128><<<B * H, C / H, 0, stream>>>(B, T, C, H, scale, k_d, v_d, r_d, td_d, s_d, dst_d);
}
}
+3
View File
@@ -0,0 +1,3 @@
#include "common.cuh"
void ggml_cuda_op_gated_linear_attn(ggml_backend_cuda_context & ctx, ggml_tensor * dst);
+2 -2
View File
@@ -73,9 +73,9 @@ void ggml_cuda_op_rwkv_wkv6(ggml_backend_cuda_context & ctx, ggml_tensor * dst)
const float * s_d = (const float *)dst->src[5]->data;
const int64_t B = dst->src[5]->ne[1];
const int64_t T = dst->src[0]->ne[3];
const int64_t T = dst->src[0]->ne[2];
const int64_t C = dst->ne[0];
const int64_t H = dst->src[0]->ne[2];
const int64_t H = dst->src[0]->ne[1];
float * dst_d = (float *)dst->data;
+11 -16
View File
@@ -3,6 +3,8 @@
// GGML internal header
#include "ggml.h"
#include "gguf.h"
#include <assert.h>
#include <math.h>
#include <stdlib.h> // load `stdlib.h` before other headers to work around MinGW bug: https://sourceforge.net/p/mingw-w64/bugs/192/
@@ -551,22 +553,15 @@ static inline ggml_bf16_t ggml_compute_fp32_to_bf16(float s) {
#define GGML_FP32_TO_BF16(x) ggml_compute_fp32_to_bf16(x)
#define GGML_BF16_TO_FP32(x) ggml_compute_bf16_to_fp32(x)
// expose GGUF internals for test code
GGML_API size_t gguf_type_size(enum gguf_type type);
GGML_API struct gguf_context * gguf_init_from_file_impl(FILE * file, struct gguf_init_params params);
struct gguf_buf {
void * data;
size_t size;
size_t offset;
};
GGML_API struct gguf_buf gguf_buf_init(size_t size);
GGML_API void gguf_buf_free(struct gguf_buf buf);
GGML_API void gguf_write_to_buf(const struct gguf_context * ctx, struct gguf_buf * buf, bool only_meta);
#ifdef __cplusplus
}
#endif
#ifdef __cplusplus
#include <vector>
// expose GGUF internals for test code
GGML_API size_t gguf_type_size(enum gguf_type type);
GGML_API struct gguf_context * gguf_init_from_file_impl(FILE * file, struct gguf_init_params params);
GGML_API void gguf_write_to_buf(const struct gguf_context * ctx, std::vector<int8_t> & buf, bool only_meta);
#endif // __cplusplus
+20 -29
View File
@@ -27,15 +27,6 @@
#endif
#include <cstring>
#define UNUSED GGML_UNUSED
#define GGML_DEBUG 0
#if (GGML_DEBUG >= 1)
#define GGML_PRINT_DEBUG(...) printf(__VA_ARGS__)
#else
#define GGML_PRINT_DEBUG(...)
#endif
#ifdef _WIN32
typedef SOCKET sockfd_t;
using ssize_t = __int64;
@@ -411,7 +402,7 @@ static std::shared_ptr<socket_t> get_socket(const std::string & endpoint) {
initialized = true;
}
#else
UNUSED(initialized);
GGML_UNUSED(initialized);
#endif
auto sock = socket_connect(host.c_str(), port);
if (sock == nullptr) {
@@ -640,7 +631,7 @@ static void ggml_backend_rpc_free(ggml_backend_t backend) {
}
static void ggml_backend_rpc_synchronize(ggml_backend_t backend) {
UNUSED(backend);
GGML_UNUSED(backend);
// this is no-op because we don't have any async operations
}
@@ -850,7 +841,7 @@ void rpc_server::alloc_buffer(const rpc_msg_alloc_buffer_req & request, rpc_msg_
GGML_PRINT_DEBUG("[%s] size: %" PRIu64 " -> remote_ptr: %" PRIx64 ", remote_size: %" PRIu64 "\n", __func__, request.size, response.remote_ptr, response.remote_size);
buffers.insert(buffer);
} else {
GGML_PRINT_DEBUG("[%s] size: %" PRIu64 " -> failed\n", __func__, request.size);
GGML_LOG_ERROR("[%s] size: %" PRIu64 " -> failed\n", __func__, request.size);
}
}
@@ -872,7 +863,7 @@ bool rpc_server::buffer_get_base(const rpc_msg_buffer_get_base_req & request, rp
GGML_PRINT_DEBUG("[%s] remote_ptr: %" PRIx64 "\n", __func__, request.remote_ptr);
ggml_backend_buffer_t buffer = reinterpret_cast<ggml_backend_buffer_t>(request.remote_ptr);
if (buffers.find(buffer) == buffers.end()) {
GGML_PRINT_DEBUG("[%s] buffer not found\n", __func__);
GGML_LOG_ERROR("[%s] buffer not found\n", __func__);
return false;
}
void * base = ggml_backend_buffer_get_base(buffer);
@@ -884,7 +875,7 @@ bool rpc_server::free_buffer(const rpc_msg_free_buffer_req & request) {
GGML_PRINT_DEBUG("[%s] remote_ptr: %" PRIx64 "\n", __func__, request.remote_ptr);
ggml_backend_buffer_t buffer = reinterpret_cast<ggml_backend_buffer_t>(request.remote_ptr);
if (buffers.find(buffer) == buffers.end()) {
GGML_PRINT_DEBUG("[%s] buffer not found\n", __func__);
GGML_LOG_ERROR("[%s] buffer not found\n", __func__);
return false;
}
ggml_backend_buffer_free(buffer);
@@ -896,7 +887,7 @@ bool rpc_server::buffer_clear(const rpc_msg_buffer_clear_req & request) {
GGML_PRINT_DEBUG("[%s] remote_ptr: %" PRIx64 ", value: %u\n", __func__, request.remote_ptr, request.value);
ggml_backend_buffer_t buffer = reinterpret_cast<ggml_backend_buffer_t>(request.remote_ptr);
if (buffers.find(buffer) == buffers.end()) {
GGML_PRINT_DEBUG("[%s] buffer not found\n", __func__);
GGML_LOG_ERROR("[%s] buffer not found\n", __func__);
return false;
}
ggml_backend_buffer_clear(buffer, request.value);
@@ -952,7 +943,7 @@ bool rpc_server::set_tensor(const std::vector<uint8_t> & input) {
struct ggml_context * ctx = ggml_init(params);
ggml_tensor * tensor = deserialize_tensor(ctx, in_tensor);
if (tensor == nullptr) {
GGML_PRINT_DEBUG("[%s] error deserializing tensor\n", __func__);
GGML_LOG_ERROR("[%s] error deserializing tensor\n", __func__);
ggml_free(ctx);
return false;
}
@@ -1017,7 +1008,7 @@ bool rpc_server::get_tensor(const rpc_msg_get_tensor_req & request, std::vector<
struct ggml_context * ctx = ggml_init(params);
ggml_tensor * tensor = deserialize_tensor(ctx, &request.tensor);
if (tensor == nullptr) {
GGML_PRINT_DEBUG("[%s] error deserializing tensor\n", __func__);
GGML_LOG_ERROR("[%s] error deserializing tensor\n", __func__);
ggml_free(ctx);
return false;
}
@@ -1051,7 +1042,7 @@ bool rpc_server::copy_tensor(const rpc_msg_copy_tensor_req & request, rpc_msg_co
ggml_tensor * src = deserialize_tensor(ctx, &request.src);
ggml_tensor * dst = deserialize_tensor(ctx, &request.dst);
if (src == nullptr || dst == nullptr) {
GGML_PRINT_DEBUG("[%s] error deserializing tensors\n", __func__);
GGML_LOG_ERROR("[%s] error deserializing tensors\n", __func__);
ggml_free(ctx);
return false;
}
@@ -1385,14 +1376,14 @@ static void ggml_backend_rpc_device_get_memory(ggml_backend_dev_t dev, size_t *
ggml_backend_rpc_get_device_memory(ctx->endpoint.c_str(), free, total);
UNUSED(dev);
GGML_UNUSED(dev);
}
static enum ggml_backend_dev_type ggml_backend_rpc_device_get_type(ggml_backend_dev_t dev) {
// TODO: obtain value from the server
return GGML_BACKEND_DEVICE_TYPE_GPU;
UNUSED(dev);
GGML_UNUSED(dev);
}
static void ggml_backend_rpc_device_get_props(ggml_backend_dev_t dev, struct ggml_backend_dev_props * props) {
@@ -1413,7 +1404,7 @@ static ggml_backend_t ggml_backend_rpc_device_init(ggml_backend_dev_t dev, const
return ggml_backend_rpc_init(ctx->endpoint.c_str());
UNUSED(params);
GGML_UNUSED(params);
}
static ggml_backend_buffer_type_t ggml_backend_rpc_device_get_buffer_type(ggml_backend_dev_t dev) {
@@ -1421,12 +1412,12 @@ static ggml_backend_buffer_type_t ggml_backend_rpc_device_get_buffer_type(ggml_b
return ggml_backend_rpc_buffer_type(ctx->endpoint.c_str());
UNUSED(dev);
GGML_UNUSED(dev);
}
static bool ggml_backend_rpc_device_supports_op(ggml_backend_dev_t dev, const struct ggml_tensor * op) {
UNUSED(dev);
UNUSED(op);
GGML_UNUSED(dev);
GGML_UNUSED(op);
//TODO: call the remote backend and cache the results
return true;
}
@@ -1463,20 +1454,20 @@ static const struct ggml_backend_device_i ggml_backend_rpc_device_i = {
static const char * ggml_backend_rpc_reg_get_name(ggml_backend_reg_t reg) {
return "RPC";
UNUSED(reg);
GGML_UNUSED(reg);
}
static size_t ggml_backend_rpc_reg_get_device_count(ggml_backend_reg_t reg) {
return 0;
UNUSED(reg);
GGML_UNUSED(reg);
}
static ggml_backend_dev_t ggml_backend_rpc_reg_get_device(ggml_backend_reg_t reg, size_t index) {
GGML_ABORT("The RPC backend does not have enumerated devices - use ggml_backend_add_device instead");
UNUSED(reg);
UNUSED(index);
GGML_UNUSED(reg);
GGML_UNUSED(index);
}
static void * ggml_backend_rpc_get_proc_address(ggml_backend_reg_t reg, const char * name) {
@@ -1485,7 +1476,7 @@ static void * ggml_backend_rpc_get_proc_address(ggml_backend_reg_t reg, const ch
}
return NULL;
UNUSED(reg);
GGML_UNUSED(reg);
}
static const struct ggml_backend_reg_i ggml_backend_rpc_reg_i = {
+4
View File
@@ -51,6 +51,10 @@ void ggml_sycl_host_free(void* ptr) try {
std::exit(1);
}
bool gpu_has_xmx(sycl::device &dev) {
return dev.has(sycl::aspect::ext_intel_matrix);
}
int64_t downsample_sycl_global_range(int64_t accumulate_block_num, int64_t block_size) {
const int64_t max_range = std::numeric_limits<int>::max();
int64_t sycl_down_blk_size = block_size;
+1
View File
@@ -662,6 +662,7 @@ inline void ggml_sycl_op_bin_bcast(ggml_backend_sycl_context & ctx, const ggml_t
}
}
bool gpu_has_xmx(sycl::device &dev);
void ggml_sycl_op_flatten(ggml_backend_sycl_context & ctx, const ggml_tensor *src0,
const ggml_tensor *src1, ggml_tensor *dst,
+3 -2
View File
@@ -158,8 +158,9 @@ static void concat_f32_sycl_non_cont(
});
}
void ggml_sycl_op_concat(ggml_backend_sycl_context & ctx, const ggml_tensor *src0,
const ggml_tensor *src1, ggml_tensor *dst) {
void ggml_sycl_op_concat(ggml_backend_sycl_context & ctx, ggml_tensor *dst) {
const ggml_tensor *src0 = dst->src[0];
const ggml_tensor *src1 = dst->src[1];
queue_ptr stream = ctx.stream();
const int32_t dim = ((int32_t *)dst->op_params)[0];
+1 -2
View File
@@ -15,7 +15,6 @@
#include "common.hpp"
void ggml_sycl_op_concat(ggml_backend_sycl_context & ctx, const ggml_tensor *src0,
const ggml_tensor *src1, ggml_tensor *dst);
void ggml_sycl_op_concat(ggml_backend_sycl_context & ctx, ggml_tensor *dst);
#endif // GGML_SYCL_CONCAT_HPP
+3 -2
View File
@@ -71,8 +71,9 @@ static void conv_transpose_1d_f32_f32_sycl(
});
}
void ggml_sycl_op_conv_transpose_1d(ggml_backend_sycl_context & ctx, const ggml_tensor *src0,
const ggml_tensor *src1, ggml_tensor *dst) {
void ggml_sycl_op_conv_transpose_1d(ggml_backend_sycl_context & ctx, ggml_tensor *dst) {
const ggml_tensor *src0 = dst->src[0];
const ggml_tensor *src1 = dst->src[1];
const float * src0_d = (const float *)src0->data;
const float * src1_d = (const float *)src1->data;
+1 -2
View File
@@ -15,7 +15,6 @@
#include "common.hpp"
void ggml_sycl_op_conv_transpose_1d(ggml_backend_sycl_context & ctx, const ggml_tensor *src0,
const ggml_tensor *src1, ggml_tensor *dst);
void ggml_sycl_op_conv_transpose_1d(ggml_backend_sycl_context & ctx, ggml_tensor *dst);
#endif // GGML_SYCL_CONV_HPP
+48 -48
View File
@@ -882,149 +882,149 @@ inline void ggml_sycl_op_div(ggml_backend_sycl_context & ctx, const ggml_tensor
}
void ggml_sycl_sqrt(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
void ggml_sycl_sqrt(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_sqrt);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_sqrt);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
void ggml_sycl_sin(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
void ggml_sycl_sin(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_sin);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_sin);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
void ggml_sycl_cos(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
void ggml_sycl_cos(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_cos);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_cos);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
void ggml_sycl_acc(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
void ggml_sycl_acc(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_acc);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_acc);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
void ggml_sycl_gelu(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
void ggml_sycl_gelu(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_gelu);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_gelu);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
void ggml_sycl_silu(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
void ggml_sycl_silu(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_silu);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_silu);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
void ggml_sycl_gelu_quick(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
void ggml_sycl_gelu_quick(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_gelu_quick);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_gelu_quick);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
void ggml_sycl_tanh(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
void ggml_sycl_tanh(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_tanh);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_tanh);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
void ggml_sycl_relu(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
void ggml_sycl_relu(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_relu);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_relu);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
void ggml_sycl_sigmoid(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
void ggml_sycl_sigmoid(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_sigmoid);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_sigmoid);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
void ggml_sycl_hardsigmoid(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
void ggml_sycl_hardsigmoid(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_hardsigmoid);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_hardsigmoid);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
void ggml_sycl_hardswish(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
void ggml_sycl_hardswish(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_hardswish);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_hardswish);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
void ggml_sycl_exp(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
void ggml_sycl_exp(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_exp);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_exp);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
void ggml_sycl_log(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
void ggml_sycl_log(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_log);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_log);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
void ggml_sycl_neg(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
void ggml_sycl_neg(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_neg);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_neg);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
void ggml_sycl_step(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
void ggml_sycl_step(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_step);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_step);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
void ggml_sycl_leaky_relu(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
void ggml_sycl_leaky_relu(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_leaky_relu);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_leaky_relu);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
void ggml_sycl_sqr(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
void ggml_sycl_sqr(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_sqr);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_sqr);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
void ggml_sycl_upscale(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
void ggml_sycl_upscale(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_upscale);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_upscale);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
void ggml_sycl_pad(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
void ggml_sycl_pad(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_pad);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_pad);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
void ggml_sycl_add(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
void ggml_sycl_add(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_add);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_add);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
void ggml_sycl_sub(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
void ggml_sycl_sub(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_sub);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_sub);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
void ggml_sycl_mul(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
void ggml_sycl_mul(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_mul);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_mul);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
void ggml_sycl_div(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
void ggml_sycl_div(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_div);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_div);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
+24 -24
View File
@@ -25,52 +25,52 @@ static __dpct_inline__ float op_div(const float a, const float b) {
}
void ggml_sycl_sqrt(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
void ggml_sycl_sqrt(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_sin(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
void ggml_sycl_sin(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_cos(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
void ggml_sycl_cos(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_acc(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
void ggml_sycl_acc(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_gelu(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
void ggml_sycl_gelu(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_silu(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
void ggml_sycl_silu(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_gelu_quick(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
void ggml_sycl_gelu_quick(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_tanh(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
void ggml_sycl_tanh(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_relu(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
void ggml_sycl_relu(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_sigmoid(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
void ggml_sycl_sigmoid(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_hardsigmoid(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
void ggml_sycl_hardsigmoid(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_hardswish(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
void ggml_sycl_hardswish(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_exp(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
void ggml_sycl_exp(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_log(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
void ggml_sycl_log(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_neg(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
void ggml_sycl_neg(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_step(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
void ggml_sycl_step(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_leaky_relu(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
void ggml_sycl_leaky_relu(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_sqr(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
void ggml_sycl_sqr(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_upscale(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
void ggml_sycl_upscale(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_pad(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
void ggml_sycl_pad(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_add(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
void ggml_sycl_add(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_sub(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
void ggml_sycl_sub(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_mul(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
void ggml_sycl_mul(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_div(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
void ggml_sycl_div(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
#endif // GGML_SYCL_ELEMENTWISE_HPP
+125 -137
View File
@@ -54,18 +54,12 @@ static ggml_sycl_device_info ggml_sycl_init() {
GGML_ASSERT(info.device_count <= GGML_SYCL_MAX_DEVICES);
int64_t total_vram = 0;
#if defined(GGML_SYCL_FORCE_MMQ)
GGML_LOG_INFO("%s: GGML_SYCL_FORCE_MMQ: yes\n", __func__);
#else
GGML_LOG_INFO("%s: GGML_SYCL_FORCE_MMQ: no\n", __func__);
#endif
#if defined(SYCL_USE_XMX)
GGML_LOG_INFO("%s: SYCL_USE_XMX: yes\n", __func__);
#else
GGML_LOG_INFO("%s: SYCL_USE_XMX: no\n", __func__);
#endif
GGML_LOG_INFO("%s: found %d %s devices:\n", __func__, info.device_count, GGML_SYCL_NAME);
/* This is a bit misleading; reserved for later */
// #if defined(SYCL_USE_XMX)
// GGML_LOG_INFO("%s: SYCL_USE_XMX: yes\n", __func__);
// #else
// GGML_LOG_INFO("%s: SYCL_USE_XMX: no\n", __func__);
// #endif
for (int i = 0; i < info.device_count; ++i) {
info.devices[i].vmm = 0;
dpct::device_info prop;
@@ -109,11 +103,11 @@ void print_device_detail(int id, sycl::device &device, std::string device_type)
name = std::regex_replace(name, std::regex("\\(TM\\)"), "");
auto global_mem_size = prop.get_global_mem_size()/1000000;
GGML_LOG_INFO("|%2d|%19s|%39s|%7s|%7d|%8d|%5d|%6luM|%21s|\n", id, device_type.c_str(),
std::string xmx = gpu_has_xmx(device) ? "yes" : "no";
GGML_LOG_INFO("|%2d|%19s|%39s|%7s|%7d|%8d|%5d|%6luM|%21s|%14s|\n", id, device_type.c_str(),
name.c_str(), version.c_str(), prop.get_max_compute_units(),
prop.get_max_work_group_size(), prop.get_max_sub_group_size(),
global_mem_size, device.get_info<sycl::info::device::driver_version>().c_str());
global_mem_size, device.get_info<sycl::info::device::driver_version>().c_str(), xmx.c_str());
}
void ggml_backend_sycl_print_sycl_devices() {
@@ -124,16 +118,16 @@ void ggml_backend_sycl_print_sycl_devices() {
GGML_LOG_INFO(
"| | | | "
" |Max | |Max |Global | |\n");
" |Max | |Max |Global | | XMX |\n");
GGML_LOG_INFO(
"| | | | "
" |compute|Max work|sub |mem | |\n");
" |compute|Max work|sub |mem | | or |\n");
GGML_LOG_INFO(
"|ID| Device Type| "
"Name|Version|units |group |group|size | Driver version|\n");
"Name|Version|units |group |group|size | Driver version| Tensor Cores |\n");
GGML_LOG_INFO(
"|--|-------------------|---------------------------------------|------"
"-|-------|--------|-----|-------|---------------------|\n");
"-|-------|--------|-----|-------|---------------------|--------------|\n");
for (int id = 0; id < device_count; ++id) {
sycl::device device = dpct::dev_mgr::instance().get_device(id);
@@ -164,14 +158,18 @@ static void ggml_check_sycl() try {
static bool initialized = false;
if (!initialized) {
GGML_LOG_INFO("[SYCL] call ggml_check_sycl\n");
GGML_SYCL_DEBUG("[SYCL] call ggml_check_sycl\n");
g_ggml_sycl_debug = get_sycl_env("GGML_SYCL_DEBUG", 0);
GGML_LOG_INFO("%s: GGML_SYCL_DEBUG: %d\n", __func__, g_ggml_sycl_debug);
#if defined(GGML_SYCL_F16)
GGML_LOG_INFO("%s: GGML_SYCL_F16: yes\n", __func__);
GGML_LOG_INFO("GGML_SYCL_DEBUG: %d\n", g_ggml_sycl_debug);
#if defined(GGML_SYCL_FORCE_MMQ)
GGML_LOG_INFO("GGML_SYCL_FORCE_MMQ: yes\n");
#else
GGML_LOG_INFO("%s: GGML_SYCL_F16: no\n", __func__);
GGML_LOG_INFO("GGML_SYCL_FORCE_MMQ: no\n");
#endif
#if defined(GGML_SYCL_F16)
GGML_LOG_INFO("GGML_SYCL_F16: yes\n");
#else
GGML_LOG_INFO("GGML_SYCL_F16: no\n");
#endif
/* NOT REMOVE, keep it for next optimize for XMX.
@@ -1189,7 +1187,6 @@ std::unique_ptr<ggml_sycl_pool> ggml_backend_sycl_context::new_pool_for_device(q
/// kernels
typedef void (*cpy_kernel_t)(const char * cx, char * cdst);
typedef void (*ggml_sycl_func_t)(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
typedef void (*ggml_sycl_op_mul_mat_t)(
ggml_backend_sycl_context & ctx,
const ggml_tensor *src0, const ggml_tensor *src1, ggml_tensor *dst,
@@ -3171,33 +3168,33 @@ catch (sycl::exception const &exc) {
}
static void ggml_sycl_repeat(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
static void ggml_sycl_repeat(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_repeat);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_repeat);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
static void ggml_sycl_get_rows(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
static void ggml_sycl_get_rows(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_get_rows);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_get_rows);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
static void ggml_sycl_norm(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
static void ggml_sycl_norm(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_norm);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_norm);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
static void ggml_sycl_rms_norm(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
static void ggml_sycl_rms_norm(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_rms_norm);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_rms_norm);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
static void ggml_sycl_group_norm(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
static void ggml_sycl_group_norm(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_SYCL_DEBUG("call %s\n", __func__);
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_group_norm);
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_group_norm);
GGML_SYCL_DEBUG("call %s done\n", __func__);
}
@@ -3572,9 +3569,10 @@ __dpct_inline__ static void k_copy_dst_from_contiguous(
}
}
static void ggml_sycl_mul_mat_id(ggml_backend_sycl_context & ctx, const ggml_tensor *src0,
const ggml_tensor *src1,
static void ggml_sycl_mul_mat_id(ggml_backend_sycl_context & ctx,
ggml_tensor *dst) try {
const ggml_tensor *src0 = dst->src[0];
const ggml_tensor *src1 = dst->src[1];
GGML_ASSERT(!ggml_backend_buffer_is_sycl_split(src0->buffer) && "mul_mat_id does not support split buffers");
const ggml_tensor *ids = dst->src[2];
@@ -3740,12 +3738,12 @@ catch (sycl::exception const &exc) {
std::exit(1);
}
static void ggml_sycl_scale(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_scale);
static void ggml_sycl_scale(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_scale);
}
static void ggml_sycl_clamp(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_clamp);
static void ggml_sycl_clamp(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_clamp);
}
static void ggml_sycl_cpy(ggml_backend_sycl_context & ctx, const ggml_tensor *src0, const ggml_tensor *src1,
@@ -3787,7 +3785,6 @@ static void ggml_sycl_cpy(ggml_backend_sycl_context & ctx, const ggml_tensor *sr
ggml_type_name(src0->type), ggml_type_name(src1->type));
GGML_ABORT("fatal error");
}
GGML_UNUSED(dst);
}
catch (sycl::exception const &exc) {
@@ -3796,59 +3793,52 @@ catch (sycl::exception const &exc) {
std::exit(1);
}
static void ggml_sycl_dup(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
static void ggml_sycl_dup(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
// TODO: why do we pass dst as src1 here?
ggml_sycl_cpy(ctx, src0, dst, nullptr);
GGML_UNUSED(src1);
ggml_sycl_cpy(ctx, dst->src[0], dst, nullptr);
}
static void ggml_sycl_diag_mask_inf(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_diag_mask_inf);
static void ggml_sycl_diag_mask_inf(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_diag_mask_inf);
}
static void ggml_sycl_soft_max(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_soft_max);
static void ggml_sycl_soft_max(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_soft_max);
}
static void ggml_sycl_rope(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
GGML_ASSERT(ggml_is_contiguous(src0)); // TODO: this restriction is temporary until non-cont support is implemented
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_rope);
static void ggml_sycl_rope(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_ASSERT(ggml_is_contiguous(dst->src[0])); // TODO: this restriction is temporary until non-cont support is implemented
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_rope);
}
static void ggml_sycl_pool2d(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_pool2d);
static void ggml_sycl_pool2d(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_pool2d);
}
static void ggml_sycl_im2col(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_im2col);
static void ggml_sycl_im2col(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_im2col);
}
static void ggml_sycl_sum(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
GGML_ASSERT(ggml_is_contiguous(src0));
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_sum);
static void ggml_sycl_sum(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_ASSERT(ggml_is_contiguous(dst->src[0]));
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_sum);
}
static void ggml_sycl_sum_rows(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
GGML_ASSERT(ggml_is_contiguous(src0));
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_sum_rows);
static void ggml_sycl_sum_rows(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_ASSERT(ggml_is_contiguous(dst->src[0]));
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_sum_rows);
}
static void ggml_sycl_argsort(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
GGML_ASSERT(ggml_is_contiguous(src0));
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_argsort);
static void ggml_sycl_argsort(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_ASSERT(ggml_is_contiguous(dst->src[0]));
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_argsort);
}
static void ggml_sycl_argmax(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
GGML_ASSERT(ggml_is_contiguous(src0));
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_argmax);
static void ggml_sycl_argmax(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
GGML_ASSERT(ggml_is_contiguous(dst->src[0]));
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_argmax);
}
static void ggml_sycl_nop(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
GGML_UNUSED(src0);
GGML_UNUSED(src1);
GGML_UNUSED(dst);
GGML_UNUSED(ctx);
}
void ggml_sycl_set_main_device(const int main_device) try {
if (dpct::get_current_device_id() == static_cast<unsigned int> (main_device)) {
@@ -3871,191 +3861,189 @@ catch (sycl::exception const &exc) {
std::exit(1);
}
bool ggml_sycl_compute_forward(ggml_backend_sycl_context & ctx, struct ggml_tensor * tensor) {
bool ggml_sycl_compute_forward(ggml_backend_sycl_context & ctx, struct ggml_tensor * dst) {
if (!g_sycl_loaded) return false;
ggml_sycl_func_t func;
if (dst->src[0] != nullptr && ggml_backend_buffer_is_sycl_split(dst->src[0]->buffer)) {
ggml_sycl_set_peer_access(dst->src[1]->ne[1], ctx.device);
}
switch (tensor->op) {
switch (dst->op) {
case GGML_OP_ARGMAX:
func = ggml_sycl_argmax;
ggml_sycl_argmax(ctx, dst);
break;
case GGML_OP_CONV_TRANSPOSE_1D:
func = ggml_sycl_op_conv_transpose_1d;
ggml_sycl_op_conv_transpose_1d(ctx, dst);
break;
case GGML_OP_REPEAT:
func = ggml_sycl_repeat;
ggml_sycl_repeat(ctx, dst);
break;
case GGML_OP_GET_ROWS:
func = ggml_sycl_get_rows;
ggml_sycl_get_rows(ctx, dst);
break;
case GGML_OP_DUP:
func = ggml_sycl_dup;
ggml_sycl_dup(ctx, dst);
break;
case GGML_OP_ADD:
case GGML_OP_ADD1: // TODO: more efficient implementation
func = ggml_sycl_add;
ggml_sycl_add(ctx, dst);
break;
case GGML_OP_SUB:
func = ggml_sycl_sub;
ggml_sycl_sub(ctx, dst);
break;
case GGML_OP_ACC:
func = ggml_sycl_acc;
ggml_sycl_acc(ctx, dst);
break;
case GGML_OP_MUL:
func = ggml_sycl_mul;
ggml_sycl_mul(ctx, dst);
break;
case GGML_OP_LOG:
func = ggml_sycl_log;
ggml_sycl_log(ctx, dst);
break;
case GGML_OP_DIV:
func = ggml_sycl_div;
ggml_sycl_div(ctx, dst);
break;
case GGML_OP_UNARY:
switch (ggml_get_unary_op(tensor)) {
switch (ggml_get_unary_op(dst)) {
case GGML_UNARY_OP_NEG:
func = ggml_sycl_neg;
ggml_sycl_neg(ctx, dst);
break;
case GGML_UNARY_OP_STEP:
func = ggml_sycl_step;
ggml_sycl_step(ctx, dst);
break;
case GGML_UNARY_OP_GELU:
func = ggml_sycl_gelu;
ggml_sycl_gelu(ctx, dst);
break;
case GGML_UNARY_OP_SILU:
func = ggml_sycl_silu;
ggml_sycl_silu(ctx, dst);
break;
case GGML_UNARY_OP_GELU_QUICK:
func = ggml_sycl_gelu_quick;
ggml_sycl_gelu_quick(ctx, dst);
break;
case GGML_UNARY_OP_TANH:
func = ggml_sycl_tanh;
ggml_sycl_tanh(ctx, dst);
break;
case GGML_UNARY_OP_RELU:
func = ggml_sycl_relu;
ggml_sycl_relu(ctx, dst);
break;
case GGML_UNARY_OP_SIGMOID:
func = ggml_sycl_sigmoid;
ggml_sycl_sigmoid(ctx, dst);
break;
case GGML_UNARY_OP_HARDSIGMOID:
func = ggml_sycl_hardsigmoid;
ggml_sycl_hardsigmoid(ctx, dst);
break;
case GGML_UNARY_OP_HARDSWISH:
func = ggml_sycl_hardswish;
ggml_sycl_hardswish(ctx, dst);
break;
case GGML_UNARY_OP_EXP:
func = ggml_sycl_exp;
ggml_sycl_exp(ctx, dst);
break;
default:
return false;
}
break;
case GGML_OP_NORM:
func = ggml_sycl_norm;
ggml_sycl_norm(ctx, dst);
break;
case GGML_OP_GROUP_NORM:
func = ggml_sycl_group_norm;
ggml_sycl_group_norm(ctx, dst);
break;
case GGML_OP_CONCAT:
func = ggml_sycl_op_concat;
ggml_sycl_op_concat(ctx, dst);
break;
case GGML_OP_UPSCALE:
func = ggml_sycl_upscale;
ggml_sycl_upscale(ctx, dst);
break;
case GGML_OP_PAD:
func = ggml_sycl_pad;
ggml_sycl_pad(ctx, dst);
break;
case GGML_OP_LEAKY_RELU:
func = ggml_sycl_leaky_relu;
ggml_sycl_leaky_relu(ctx, dst);
break;
case GGML_OP_RMS_NORM:
func = ggml_sycl_rms_norm;
ggml_sycl_rms_norm(ctx, dst);
break;
case GGML_OP_MUL_MAT:
if (tensor->src[0]->ne[3] != tensor->src[1]->ne[3]) {
if (dst->src[0]->ne[3] != dst->src[1]->ne[3]) {
return false;
}
func = ggml_sycl_mul_mat;
/* ggml_sycl_mul_mat_id is dependent on ggml_sycl_mul_mat */
ggml_sycl_mul_mat(ctx, dst->src[0], dst->src[1], dst);
break;
case GGML_OP_MUL_MAT_ID:
if (tensor->src[0]->ne[3] != tensor->src[1]->ne[3]) {
if (dst->src[0]->ne[3] != dst->src[1]->ne[3]) {
return false;
}
func = ggml_sycl_mul_mat_id;
ggml_sycl_mul_mat_id(ctx, dst);
break;
case GGML_OP_OUT_PROD:
func = ggml_sycl_op_out_prod;
ggml_sycl_op_out_prod(ctx, dst);
break;
case GGML_OP_SCALE:
func = ggml_sycl_scale;
ggml_sycl_scale(ctx, dst);
break;
case GGML_OP_SQR:
func = ggml_sycl_sqr;
ggml_sycl_sqr(ctx, dst);
break;
case GGML_OP_SQRT:
func = ggml_sycl_sqrt;
ggml_sycl_sqrt(ctx, dst);
break;
case GGML_OP_SIN:
func = ggml_sycl_sin;
ggml_sycl_sin(ctx, dst);
break;
case GGML_OP_COS:
func = ggml_sycl_cos;
ggml_sycl_cos(ctx, dst);
break;
case GGML_OP_CLAMP:
func = ggml_sycl_clamp;
ggml_sycl_clamp(ctx, dst);
break;
case GGML_OP_CPY:
func = ggml_sycl_cpy;
ggml_sycl_cpy(ctx, dst->src[0], dst->src[1], dst);
break;
case GGML_OP_CONT:
func = ggml_sycl_dup;
ggml_sycl_dup(ctx, dst);
break;
case GGML_OP_NONE:
case GGML_OP_RESHAPE:
case GGML_OP_VIEW:
case GGML_OP_PERMUTE:
case GGML_OP_TRANSPOSE:
func = ggml_sycl_nop;
GGML_SYCL_DEBUG("%s: Tensor NO-OP\n", __func__);
break;
case GGML_OP_DIAG_MASK_INF:
func = ggml_sycl_diag_mask_inf;
ggml_sycl_diag_mask_inf(ctx, dst);
break;
case GGML_OP_SOFT_MAX:
func = ggml_sycl_soft_max;
ggml_sycl_soft_max(ctx, dst);
break;
case GGML_OP_ROPE:
func = ggml_sycl_rope;
ggml_sycl_rope(ctx, dst);
break;
case GGML_OP_IM2COL:
func = ggml_sycl_im2col;
ggml_sycl_im2col(ctx, dst);
break;
case GGML_OP_POOL_2D:
func = ggml_sycl_pool2d;
ggml_sycl_pool2d(ctx, dst);
break;
case GGML_OP_SUM:
func = ggml_sycl_sum;
ggml_sycl_sum(ctx, dst);
break;
case GGML_OP_SUM_ROWS:
func = ggml_sycl_sum_rows;
ggml_sycl_sum_rows(ctx, dst);
break;
case GGML_OP_ARGSORT:
func = ggml_sycl_argsort;
ggml_sycl_argsort(ctx, dst);
break;
case GGML_OP_TIMESTEP_EMBEDDING:
func = ggml_sycl_op_timestep_embedding;
ggml_sycl_op_timestep_embedding(ctx, dst);
break;
case GGML_OP_RWKV_WKV6:
func = ggml_sycl_op_rwkv_wkv6;
ggml_sycl_op_rwkv_wkv6(ctx, dst);
break;
default:
return false;
}
if (tensor->src[0] != nullptr && ggml_backend_buffer_is_sycl_split(tensor->src[0]->buffer)) {
ggml_sycl_set_peer_access(tensor->src[1]->ne[1], ctx.device);
}
func(ctx, tensor->src[0], tensor->src[1], tensor);
return true;
}
+3 -3
View File
@@ -3,9 +3,9 @@
#include "outprod.hpp"
void ggml_sycl_op_out_prod(ggml_backend_sycl_context& ctx, const ggml_tensor* src0,
const ggml_tensor* src1, ggml_tensor* dst) {
void ggml_sycl_op_out_prod(ggml_backend_sycl_context& ctx, ggml_tensor* dst) {
const ggml_tensor *src0 = dst->src[0];
const ggml_tensor *src1 = dst->src[1];
GGML_ASSERT(src0->type == GGML_TYPE_F32);
GGML_ASSERT(src1->type == GGML_TYPE_F32);
+1 -2
View File
@@ -3,8 +3,7 @@
#include "common.hpp"
void ggml_sycl_op_out_prod(ggml_backend_sycl_context& ctx, const ggml_tensor* src0,
const ggml_tensor* src1, ggml_tensor* dst);
void ggml_sycl_op_out_prod(ggml_backend_sycl_context& ctx, ggml_tensor* dst);
#endif // GGML_SYCL_OUTPROD_HPP
+3 -2
View File
@@ -55,8 +55,9 @@ static void timestep_embedding_f32_sycl(
});
}
void ggml_sycl_op_timestep_embedding(ggml_backend_sycl_context & ctx, const ggml_tensor *src0,
const ggml_tensor *src1, ggml_tensor * dst) {
void ggml_sycl_op_timestep_embedding(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
const ggml_tensor *src0 = dst->src[0];
const ggml_tensor *src1 = dst->src[1];
const float * src0_d = (const float *)src0->data;
float * dst_d = (float *)dst->data;
dpct::queue_ptr stream = ctx.stream();
+1 -2
View File
@@ -15,7 +15,6 @@
#include "common.hpp"
void ggml_sycl_op_timestep_embedding(ggml_backend_sycl_context & ctx, const ggml_tensor *src0,
const ggml_tensor *src1, ggml_tensor * dst);
void ggml_sycl_op_timestep_embedding(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
#endif // GGML_SYCL_TSEMBD_HPP
+6 -4
View File
@@ -95,8 +95,10 @@ static void rwkv_wkv_f32_kernel(
}
}
void ggml_sycl_op_rwkv_wkv6(ggml_backend_sycl_context& ctx, const ggml_tensor* src0,
const ggml_tensor* src1, ggml_tensor* dst) {
void ggml_sycl_op_rwkv_wkv6(ggml_backend_sycl_context& ctx, ggml_tensor* dst) {
const ggml_tensor *src0 = dst->src[0];
const ggml_tensor *src1 = dst->src[1];
const float* k_d = (const float*)dst->src[0]->data;
const float* v_d = (const float*)dst->src[1]->data;
@@ -107,9 +109,9 @@ void ggml_sycl_op_rwkv_wkv6(ggml_backend_sycl_context& ctx, const ggml_tensor* s
float* dst_d = (float*)dst->data;
const int64_t B = dst->src[5]->ne[1];
const int64_t T = dst->src[0]->ne[3];
const int64_t T = dst->src[0]->ne[2];
const int64_t C = dst->ne[0];
const int64_t H = dst->src[0]->ne[2];
const int64_t H = dst->src[0]->ne[1];
GGML_ASSERT(dst->src[5]->type == GGML_TYPE_F32);
GGML_ASSERT(C % H == 0);
+1 -2
View File
@@ -3,8 +3,7 @@
#include "common.hpp"
void ggml_sycl_op_rwkv_wkv6(ggml_backend_sycl_context & ctx, const ggml_tensor *src0,
const ggml_tensor *src1, ggml_tensor * dst);
void ggml_sycl_op_rwkv_wkv6(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
#endif // GGML_SYCL_WKV6_HPP
+19 -1
View File
@@ -8,6 +8,20 @@ if (Vulkan_FOUND)
../../include/ggml-vulkan.h
)
# Compile a test shader to determine whether GL_KHR_cooperative_matrix is supported.
# If it's not, there will be an error to stderr.
# If it's supported, set a define to indicate that we should compile those shaders
execute_process(COMMAND ${Vulkan_GLSLC_EXECUTABLE} -o - -fshader-stage=compute --target-env=vulkan1.3 "${CMAKE_CURRENT_SOURCE_DIR}/vulkan-shaders/test_coopmat_support.comp"
OUTPUT_VARIABLE glslc_output
ERROR_VARIABLE glslc_error)
if (${glslc_error} MATCHES ".*extension not supported: GL_KHR_cooperative_matrix.*")
message(STATUS "GL_KHR_cooperative_matrix not supported by glslc")
else()
message(STATUS "GL_KHR_cooperative_matrix supported by glslc")
add_compile_definitions(GGML_VULKAN_COOPMAT_GLSLC_SUPPORT)
endif()
# Compile a test shader to determine whether GL_NV_cooperative_matrix2 is supported.
# If it's not, there will be an error to stderr.
# If it's supported, set a define to indicate that we should compile those shaders
@@ -69,11 +83,15 @@ if (Vulkan_FOUND)
file(GLOB _ggml_vk_shader_deps "${_ggml_vk_input_dir}/*.comp")
if (NOT CMAKE_CROSSCOMPILING)
set(_ggml_vk_genshaders_cmd "$<TARGET_FILE_DIR:vulkan-shaders-gen>/${_ggml_vk_genshaders_cmd}")
endif ()
add_custom_command(
OUTPUT ${_ggml_vk_header}
${_ggml_vk_source}
COMMAND "$<TARGET_FILE_DIR:vulkan-shaders-gen>/${_ggml_vk_genshaders_cmd}"
COMMAND ${_ggml_vk_genshaders_cmd}
--glslc ${Vulkan_GLSLC_EXECUTABLE}
--input-dir ${_ggml_vk_input_dir}
--output-dir ${_ggml_vk_output_dir}
+18 -6
View File
@@ -1645,6 +1645,7 @@ static void ggml_vk_load_shaders(vk_device& device) {
#undef CREATE_MM2
} else
#endif // defined(VK_NV_cooperative_matrix2) && defined(GGML_VULKAN_COOPMAT2_GLSLC_SUPPORT)
#if defined(VK_KHR_cooperative_matrix) && defined(GGML_VULKAN_COOPMAT_GLSLC_SUPPORT)
if (device->coopmat_support) {
// Create 6 variants, {s,m,l}x{unaligned,aligned}
#define CREATE_MM(PIPELINE_NAME, NAMELC, F16ACC, WG_DENOMS, WARPTILE, PUSHCONST, PARAMCOUNT, ID) \
@@ -1739,7 +1740,9 @@ static void ggml_vk_load_shaders(vk_device& device) {
}
#undef CREATE_MM2
#undef CREATE_MM
} else if (device->fp16) {
} else
#endif // defined(VK_KHR_cooperative_matrix) && defined(GGML_VULKAN_COOPMAT_GLSLC_SUPPORT)
if (device->fp16) {
// Create 6 variants, {s,m,l}x{unaligned,aligned}
#define CREATE_MM(PIPELINE_NAME, NAMELC, F16ACC, WG_DENOMS, WARPTILE, PUSHCONST, PARAMCOUNT, ID) \
if (device->mul_mat ## ID ## _l) \
@@ -2242,6 +2245,7 @@ static vk_device ggml_vk_get_device(size_t idx) {
last_struct = (VkBaseOutStructure *)&subgroup_size_control_features;
}
#if defined(VK_KHR_cooperative_matrix)
VkPhysicalDeviceCooperativeMatrixFeaturesKHR coopmat_features;
coopmat_features.pNext = nullptr;
coopmat_features.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_COOPERATIVE_MATRIX_FEATURES_KHR;
@@ -2251,6 +2255,7 @@ static vk_device ggml_vk_get_device(size_t idx) {
last_struct->pNext = (VkBaseOutStructure *)&coopmat_features;
last_struct = (VkBaseOutStructure *)&coopmat_features;
}
#endif
#if defined(VK_NV_cooperative_matrix2)
VkPhysicalDeviceCooperativeMatrix2FeaturesNV coopmat2_features {};
@@ -2272,6 +2277,7 @@ static vk_device ggml_vk_get_device(size_t idx) {
if (device->subgroup_size_control) {
device->subgroup_min_size = subgroup_size_control_props.minSubgroupSize;
device->subgroup_max_size = subgroup_size_control_props.maxSubgroupSize;
device_extensions.push_back("VK_EXT_subgroup_size_control");
}
device->subgroup_size_control = device->subgroup_size_control &&
@@ -2280,10 +2286,11 @@ static vk_device ggml_vk_get_device(size_t idx) {
if (device->subgroup_size_control) {
device->subgroup_require_full_support = subgroup_size_control_features.computeFullSubgroups;
device_extensions.push_back("VK_EXT_subgroup_size_control");
}
#if defined(VK_KHR_cooperative_matrix)
device->coopmat_support = device->coopmat_support && coopmat_features.cooperativeMatrix;
#endif
if (coopmat2_support) {
#if defined(VK_NV_cooperative_matrix2) && defined(GGML_VULKAN_COOPMAT2_GLSLC_SUPPORT)
@@ -2376,6 +2383,7 @@ static vk_device ggml_vk_get_device(size_t idx) {
device_extensions.push_back("VK_KHR_shader_float16_int8");
}
#if defined(VK_KHR_cooperative_matrix)
if (device->coopmat_support) {
// Query supported shapes
std::vector<VkCooperativeMatrixPropertiesKHR> cm_props;
@@ -2442,7 +2450,7 @@ static vk_device ggml_vk_get_device(size_t idx) {
if (device->coopmat_support) {
device_extensions.push_back("VK_KHR_cooperative_matrix");
}
#endif
device->name = GGML_VK_NAME + std::to_string(idx);
device_create_info = {
@@ -2553,9 +2561,11 @@ static void ggml_vk_print_gpu_info(size_t idx) {
fp16_storage = true;
} else if (strcmp("VK_KHR_shader_float16_int8", properties.extensionName) == 0) {
fp16_compute = true;
} else if (strcmp("VK_KHR_cooperative_matrix", properties.extensionName) == 0 &&
#if defined(GGML_VULKAN_COOPMAT_GLSLC_SUPPORT)
} else if (strcmp("VK_KHR_cooperative_matrix", properties.extensionName) == 0 &&
!getenv("GGML_VK_DISABLE_COOPMAT")) {
coopmat_support = true;
#endif
#if defined(GGML_VULKAN_COOPMAT2_GLSLC_SUPPORT)
} else if (strcmp("VK_NV_cooperative_matrix2", properties.extensionName) == 0 &&
!getenv("GGML_VK_DISABLE_COOPMAT2")) {
@@ -2593,6 +2603,7 @@ static void ggml_vk_print_gpu_info(size_t idx) {
// Pointer to the last chain element
VkBaseOutStructure * last_struct = (VkBaseOutStructure *)&vk12_features;
#if defined(GGML_VULKAN_COOPMAT_GLSLC_SUPPORT)
VkPhysicalDeviceCooperativeMatrixFeaturesKHR coopmat_features;
coopmat_features.pNext = nullptr;
coopmat_features.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_COOPERATIVE_MATRIX_FEATURES_KHR;
@@ -2608,6 +2619,7 @@ static void ggml_vk_print_gpu_info(size_t idx) {
fp16 = fp16 && vk12_features.shaderFloat16;
coopmat_support = coopmat_support && coopmat_features.cooperativeMatrix;
#endif
std::string matrix_cores = coopmat2_support ? "NV_coopmat2" : coopmat_support ? "KHR_coopmat" : "none";
@@ -5621,9 +5633,9 @@ static void ggml_vk_op_f32_rwkv6(ggml_backend_vk_context * ctx, vk_context& subc
}
static void ggml_vk_rwkv_wkv6(ggml_backend_vk_context * ctx, vk_context& subctx, ggml_tensor * dst, bool dryrun = false) {
const size_t seq_length = dst->src[0]->ne[3];
const size_t seq_length = dst->src[0]->ne[2];
const size_t n_embed = dst->ne[0];
const size_t n_heads = dst->src[0]->ne[2];
const size_t n_heads = dst->src[0]->ne[1];
const size_t n_seqs = dst->src[5]->ne[1];
ggml_vk_op_f32_rwkv6(
@@ -1,9 +1,6 @@
#version 450
#ifdef FLOAT16
#extension GL_EXT_shader_explicit_arithmetic_types_float16 : require
#endif
#extension GL_EXT_shader_explicit_arithmetic_types : require
#extension GL_EXT_shader_explicit_arithmetic_types_int32 : require
#include "mul_mat_vec_base.comp"
@@ -27,8 +24,8 @@ void iter(inout FLOAT_TYPE temp[NUM_COLS][NUM_ROWS], const uint first_row, const
#if K_PER_ITER == 8
#if QUANT_R == 2
const B_TYPE_VEC4 bv02 = data_b_v4[(j*p.batch_stride_b + b_offset + iybs + iqs) / 4];
const B_TYPE_VEC4 bv13 = data_b_v4[(j*p.batch_stride_b + b_offset + iybs + iqs + y_offset) / 4];
const vec4 bv02 = vec4(data_b_v4[(j*p.batch_stride_b + b_offset + iybs + iqs) / 4]);
const vec4 bv13 = vec4(data_b_v4[(j*p.batch_stride_b + b_offset + iybs + iqs + y_offset) / 4]);
const vec4 bv0 = vec4(bv02.x, bv13.x, bv02.y, bv13.y);
const vec4 bv1 = vec4(bv02.z, bv13.z, bv02.w, bv13.w);
#else
@@ -1,5 +1,5 @@
#version 450
#extension GL_EXT_shader_explicit_arithmetic_types : require
#extension GL_EXT_shader_explicit_arithmetic_types_int32 : require
#include "mul_mat_vec_base.comp"
@@ -40,9 +40,9 @@ void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
[[unroll]] for (uint n = 0; n < num_rows; ++n) {
const uint ib0 = a_offset / QUANT_K + (first_row+n)*num_blocks_per_row;
f16vec2 d = data_a[ib0 + i].d;
const FLOAT_TYPE dall = d.x;
const FLOAT_TYPE dmin = d.y;
vec2 d = vec2(data_a[ib0 + i].d);
const FLOAT_TYPE dall = FLOAT_TYPE(d.x);
const FLOAT_TYPE dmin = FLOAT_TYPE(d.y);
uint32_t s0_u32 = data_a_packed32[ib0 + i].scales[s_offset / 4 + 0];
uint32_t s4_u32 = data_a_packed32[ib0 + i].scales[s_offset / 4 + 1];
@@ -63,14 +63,14 @@ void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
uvec2 qs16 = uvec2(unpack8(qs16_u16));
[[unroll]] for (uint j = 0; j < NUM_COLS; ++j) {
B_TYPE_VEC2 b0 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 0];
B_TYPE_VEC2 b16 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 8];
B_TYPE_VEC2 b32 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 16];
B_TYPE_VEC2 b48 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 24];
B_TYPE_VEC2 b64 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 32];
B_TYPE_VEC2 b80 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 40];
B_TYPE_VEC2 b96 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 48];
B_TYPE_VEC2 b112 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 56];
vec2 b0 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 0]);
vec2 b16 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 8]);
vec2 b32 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 16]);
vec2 b48 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 24]);
vec2 b64 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 32]);
vec2 b80 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 40]);
vec2 b96 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 48]);
vec2 b112 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 56]);
FLOAT_TYPE sum1 = FLOAT_TYPE(0.0);
FLOAT_TYPE sum2 = FLOAT_TYPE(0.0);
@@ -1,5 +1,5 @@
#version 450
#extension GL_EXT_shader_explicit_arithmetic_types : require
#extension GL_EXT_shader_explicit_arithmetic_types_int32 : require
#include "mul_mat_vec_base.comp"
@@ -60,14 +60,14 @@ void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
[[unroll]] for (uint j = 0; j < NUM_COLS; ++j) {
B_TYPE_VEC2 b0 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 0];
B_TYPE_VEC2 b16 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 8];
B_TYPE_VEC2 b32 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 16];
B_TYPE_VEC2 b48 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 24];
B_TYPE_VEC2 b64 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 32];
B_TYPE_VEC2 b80 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 40];
B_TYPE_VEC2 b96 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 48];
B_TYPE_VEC2 b112 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 56];
vec2 b0 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 0]);
vec2 b16 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 8]);
vec2 b32 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 16]);
vec2 b48 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 24]);
vec2 b64 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 32]);
vec2 b80 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 40]);
vec2 b96 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 48]);
vec2 b112 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 56]);
FLOAT_TYPE sum = FLOAT_TYPE(0.0);
[[unroll]] for (int l = 0; l < 2; ++l) {
@@ -1,6 +1,6 @@
#version 450
#extension GL_EXT_shader_explicit_arithmetic_types : require
#extension GL_EXT_shader_explicit_arithmetic_types_int32 : require
#include "mul_mat_vec_base.comp"
@@ -45,7 +45,7 @@ void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
[[unroll]] for (uint n = 0; n < num_rows; ++n) {
const uint ib0 = a_offset / QUANT_K + (first_row+n)*num_blocks_per_row;
f16vec2 d = data_a[ib0 + i].d;
vec2 d = vec2(data_a[ib0 + i].d);
const FLOAT_TYPE dall = FLOAT_TYPE(d.x);
const FLOAT_TYPE dmin = FLOAT_TYPE(d.y);
@@ -96,10 +96,10 @@ void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
const uint32_t q4_15 = qs64_hi4.w;
[[unroll]] for (uint j = 0; j < NUM_COLS; ++j) {
B_TYPE_VEC4 by10 = data_b_v4[(j*p.batch_stride_b + b_offset + y1_idx) / 4];
B_TYPE_VEC4 by132 = data_b_v4[(j*p.batch_stride_b + b_offset + y1_idx) / 4 + 8];
B_TYPE_VEC4 by20 = data_b_v4[(j*p.batch_stride_b + b_offset + y2_idx) / 4];
B_TYPE_VEC4 by232 = data_b_v4[(j*p.batch_stride_b + b_offset + y2_idx) / 4 + 8];
vec4 by10 = vec4(data_b_v4[(j*p.batch_stride_b + b_offset + y1_idx) / 4 ]);
vec4 by132 = vec4(data_b_v4[(j*p.batch_stride_b + b_offset + y1_idx) / 4 + 8]);
vec4 by20 = vec4(data_b_v4[(j*p.batch_stride_b + b_offset + y2_idx) / 4 ]);
vec4 by232 = vec4(data_b_v4[(j*p.batch_stride_b + b_offset + y2_idx) / 4 + 8]);
const FLOAT_TYPE sx = fma(FLOAT_TYPE(by10.x), q4_0, fma(FLOAT_TYPE(by10.y), q4_1, fma(FLOAT_TYPE(by10.z), q4_2, FLOAT_TYPE(by10.w) * q4_3)));
const FLOAT_TYPE sy = fma(FLOAT_TYPE(by132.x), q4_4, fma(FLOAT_TYPE(by132.y), q4_5, fma(FLOAT_TYPE(by132.z), q4_6, FLOAT_TYPE(by132.w) * q4_7)));
@@ -1,6 +1,6 @@
#version 450
#extension GL_EXT_shader_explicit_arithmetic_types : require
#extension GL_EXT_shader_explicit_arithmetic_types_int32 : require
#include "mul_mat_vec_base.comp"
@@ -42,7 +42,7 @@ void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
[[unroll]] for (uint n = 0; n < num_rows; ++n) {
const uint ib0 = a_offset / QUANT_K + (first_row+n)*num_blocks_per_row;
f16vec2 d = data_a[ib0 + i].d;
vec2 d = vec2(data_a[ib0 + i].d);
const FLOAT_TYPE dall = FLOAT_TYPE(d.x);
const FLOAT_TYPE dmin = FLOAT_TYPE(d.y);
@@ -105,14 +105,14 @@ void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
const uint32_t q4_15 = qs64_80_hi4.w;
[[unroll]] for (uint j = 0; j < NUM_COLS; ++j) {
B_TYPE_VEC2 by10 = data_b_v2[(j*p.batch_stride_b + b_offset + y1_idx) / 2];
B_TYPE_VEC2 by116 = data_b_v2[(j*p.batch_stride_b + b_offset + y1_idx) / 2 + 8];
B_TYPE_VEC2 by132 = data_b_v2[(j*p.batch_stride_b + b_offset + y1_idx) / 2 + 16];
B_TYPE_VEC2 by148 = data_b_v2[(j*p.batch_stride_b + b_offset + y1_idx) / 2 + 24];
B_TYPE_VEC2 by20 = data_b_v2[(j*p.batch_stride_b + b_offset + y2_idx) / 2];
B_TYPE_VEC2 by216 = data_b_v2[(j*p.batch_stride_b + b_offset + y2_idx) / 2 + 8];
B_TYPE_VEC2 by232 = data_b_v2[(j*p.batch_stride_b + b_offset + y2_idx) / 2 + 16];
B_TYPE_VEC2 by248 = data_b_v2[(j*p.batch_stride_b + b_offset + y2_idx) / 2 + 24];
vec2 by10 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y1_idx) / 2 ]);
vec2 by116 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y1_idx) / 2 + 8]);
vec2 by132 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y1_idx) / 2 + 16]);
vec2 by148 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y1_idx) / 2 + 24]);
vec2 by20 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y2_idx) / 2 ]);
vec2 by216 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y2_idx) / 2 + 8]);
vec2 by232 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y2_idx) / 2 + 16]);
vec2 by248 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y2_idx) / 2 + 24]);
const FLOAT_TYPE sx =
fma(FLOAT_TYPE(by10.x), q4_0,
@@ -1,6 +1,6 @@
#version 450
#extension GL_EXT_shader_explicit_arithmetic_types : require
#extension GL_EXT_shader_explicit_arithmetic_types_int32 : require
#include "mul_mat_vec_base.comp"
@@ -77,10 +77,10 @@ void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
uvec4 q3 = uvec4(unpack8(q3_u32));
[[unroll]] for (uint j = 0; j < NUM_COLS; ++j) {
B_TYPE_VEC4 by0 = data_b_v4[(j*p.batch_stride_b + b_offset + y_idx) / 4];
B_TYPE_VEC4 by32 = data_b_v4[(j*p.batch_stride_b + b_offset + y_idx) / 4 + 8];
B_TYPE_VEC4 by64 = data_b_v4[(j*p.batch_stride_b + b_offset + y_idx) / 4 + 16];
B_TYPE_VEC4 by96 = data_b_v4[(j*p.batch_stride_b + b_offset + y_idx) / 4 + 24];
vec4 by0 = vec4(data_b_v4[(j*p.batch_stride_b + b_offset + y_idx) / 4 ]);
vec4 by32 = vec4(data_b_v4[(j*p.batch_stride_b + b_offset + y_idx) / 4 + 8]);
vec4 by64 = vec4(data_b_v4[(j*p.batch_stride_b + b_offset + y_idx) / 4 + 16]);
vec4 by96 = vec4(data_b_v4[(j*p.batch_stride_b + b_offset + y_idx) / 4 + 24]);
FLOAT_TYPE sum = FLOAT_TYPE(0.0);
[[unroll]] for (int l = 0; l < 4; ++l) {
@@ -1,6 +1,5 @@
#version 450
#extension GL_EXT_shader_explicit_arithmetic_types_float16 : require
#extension GL_EXT_control_flow_attributes : enable
layout (push_constant) uniform parameter
@@ -0,0 +1,7 @@
#version 460
#extension GL_KHR_cooperative_matrix : require
void main()
{
}
@@ -2,7 +2,10 @@
#if !defined(GGML_TYPES_COMP)
#define GGML_TYPES_COMP
#extension GL_EXT_shader_explicit_arithmetic_types : require
#extension GL_EXT_shader_explicit_arithmetic_types_int32 : require
#extension GL_EXT_shader_explicit_arithmetic_types_int16 : require
#extension GL_EXT_shader_explicit_arithmetic_types_int8 : require
#extension GL_EXT_shader_16bit_storage : require
#if defined(DATA_A_F32)
#define QUANT_K 1
@@ -342,9 +342,11 @@ void process_shaders() {
matmul_shaders(true, matmul_id, false, false, false);
matmul_shaders(true, matmul_id, false, false, true);
#if defined(GGML_VULKAN_COOPMAT_GLSLC_SUPPORT)
// Coopmat, fp32acc and fp16acc
matmul_shaders(true, matmul_id, true, false, false);
matmul_shaders(true, matmul_id, true, false, true);
#endif
#if defined(GGML_VULKAN_COOPMAT2_GLSLC_SUPPORT)
// Coopmat2, fp32acc and fp16acc
+52 -1285
View File
File diff suppressed because it is too large Load Diff
+1325
View File
File diff suppressed because it is too large Load Diff
+4 -4
View File
@@ -15,13 +15,13 @@ pip install gguf
[examples/writer.py](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/examples/writer.py) — Generates `example.gguf` in the current directory to demonstrate generating a GGUF file. Note that this file cannot be used as a model.
[scripts/gguf_dump.py](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/scripts/gguf_dump.py) — Dumps a GGUF file's metadata to the console.
[gguf/scripts/gguf_dump.py](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/scripts/gguf_dump.py) — Dumps a GGUF file's metadata to the console.
[scripts/gguf_set_metadata.py](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/scripts/gguf_set_metadata.py) — Allows changing simple metadata values in a GGUF file by key.
[gguf/scripts/gguf_set_metadata.py](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/scripts/gguf_set_metadata.py) — Allows changing simple metadata values in a GGUF file by key.
[scripts/gguf_convert_endian.py](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/scripts/gguf_convert_endian.py) — Allows converting the endianness of GGUF files.
[gguf/scripts/gguf_convert_endian.py](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/scripts/gguf_convert_endian.py) — Allows converting the endianness of GGUF files.
[scripts/gguf_new_metadata.py](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/scripts/gguf_new_metadata.py) — Copies a GGUF file with added/modified/removed metadata values.
[gguf/scripts/gguf_new_metadata.py](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/scripts/gguf_new_metadata.py) — Copies a GGUF file with added/modified/removed metadata values.
## Development
Maintainers who participate in development of this package are advised to install it in editable mode:
+55
View File
@@ -115,6 +115,7 @@ class Keys:
TIME_DECAY_EXTRA_DIM = "{arch}.time_decay_extra_dim"
RESIDUAL_SCALE = "{arch}.residual_scale"
EMBEDDING_SCALE = "{arch}.embedding_scale"
TOKEN_SHIFT_COUNT = "{arch}.token_shift_count"
class Attention:
HEAD_COUNT = "{arch}.attention.head_count"
@@ -244,6 +245,7 @@ class MODEL_ARCH(IntEnum):
QWEN2VL = auto()
PHI2 = auto()
PHI3 = auto()
PHIMOE = auto()
PLAMO = auto()
CODESHELL = auto()
ORION = auto()
@@ -254,6 +256,7 @@ class MODEL_ARCH(IntEnum):
GEMMA2 = auto()
STARCODER2 = auto()
RWKV6 = auto()
RWKV6QWEN2 = auto()
MAMBA = auto()
XVERSE = auto()
COMMAND_R = auto()
@@ -333,6 +336,7 @@ class MODEL_TENSOR(IntEnum):
TIME_MIX_LERP_V = auto()
TIME_MIX_LERP_R = auto()
TIME_MIX_LERP_G = auto()
TIME_MIX_LERP_FUSED = auto()
TIME_MIX_LERP_W = auto()
TIME_MIX_FIRST = auto()
TIME_MIX_DECAY = auto()
@@ -428,6 +432,7 @@ MODEL_ARCH_NAMES: dict[MODEL_ARCH, str] = {
MODEL_ARCH.QWEN2VL: "qwen2vl",
MODEL_ARCH.PHI2: "phi2",
MODEL_ARCH.PHI3: "phi3",
MODEL_ARCH.PHIMOE: "phimoe",
MODEL_ARCH.PLAMO: "plamo",
MODEL_ARCH.CODESHELL: "codeshell",
MODEL_ARCH.ORION: "orion",
@@ -438,6 +443,7 @@ MODEL_ARCH_NAMES: dict[MODEL_ARCH, str] = {
MODEL_ARCH.GEMMA2: "gemma2",
MODEL_ARCH.STARCODER2: "starcoder2",
MODEL_ARCH.RWKV6: "rwkv6",
MODEL_ARCH.RWKV6QWEN2: "rwkv6qwen2",
MODEL_ARCH.MAMBA: "mamba",
MODEL_ARCH.XVERSE: "xverse",
MODEL_ARCH.COMMAND_R: "command-r",
@@ -517,6 +523,7 @@ TENSOR_NAMES: dict[MODEL_TENSOR, str] = {
MODEL_TENSOR.TIME_MIX_LERP_V: "blk.{bid}.time_mix_lerp_v",
MODEL_TENSOR.TIME_MIX_LERP_R: "blk.{bid}.time_mix_lerp_r",
MODEL_TENSOR.TIME_MIX_LERP_G: "blk.{bid}.time_mix_lerp_g",
MODEL_TENSOR.TIME_MIX_LERP_FUSED: "blk.{bid}.time_mix_lerp_fused",
MODEL_TENSOR.TIME_MIX_LERP_W: "blk.{bid}.time_mix_lerp_w",
MODEL_TENSOR.TIME_MIX_FIRST: "blk.{bid}.time_mix_first",
MODEL_TENSOR.TIME_MIX_DECAY: "blk.{bid}.time_mix_decay",
@@ -940,6 +947,24 @@ MODEL_TENSORS: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
MODEL_TENSOR.FFN_DOWN,
MODEL_TENSOR.FFN_UP,
],
MODEL_ARCH.PHIMOE: [
MODEL_TENSOR.TOKEN_EMBD,
MODEL_TENSOR.OUTPUT_NORM,
MODEL_TENSOR.OUTPUT,
MODEL_TENSOR.ROPE_FACTORS_LONG,
MODEL_TENSOR.ROPE_FACTORS_SHORT,
MODEL_TENSOR.ATTN_NORM,
MODEL_TENSOR.ATTN_QKV,
MODEL_TENSOR.ATTN_Q,
MODEL_TENSOR.ATTN_K,
MODEL_TENSOR.ATTN_V,
MODEL_TENSOR.ATTN_OUT,
MODEL_TENSOR.FFN_NORM,
MODEL_TENSOR.FFN_GATE_INP,
MODEL_TENSOR.FFN_GATE_EXP,
MODEL_TENSOR.FFN_DOWN_EXP,
MODEL_TENSOR.FFN_UP_EXP,
],
MODEL_ARCH.CODESHELL: [
MODEL_TENSOR.TOKEN_EMBD,
MODEL_TENSOR.POS_EMBD,
@@ -1083,6 +1108,7 @@ MODEL_TENSORS: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
MODEL_TENSOR.TIME_MIX_LERP_R,
MODEL_TENSOR.TIME_MIX_LERP_G,
MODEL_TENSOR.TIME_MIX_LERP_W,
MODEL_TENSOR.TIME_MIX_LERP_FUSED,
MODEL_TENSOR.TIME_MIX_FIRST,
MODEL_TENSOR.TIME_MIX_DECAY,
MODEL_TENSOR.TIME_MIX_DECAY_W1,
@@ -1099,6 +1125,35 @@ MODEL_TENSORS: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
MODEL_TENSOR.CHANNEL_MIX_RECEPTANCE,
MODEL_TENSOR.CHANNEL_MIX_VALUE,
],
MODEL_ARCH.RWKV6QWEN2: [
MODEL_TENSOR.TOKEN_EMBD,
MODEL_TENSOR.OUTPUT_NORM,
MODEL_TENSOR.OUTPUT,
MODEL_TENSOR.ATTN_NORM,
MODEL_TENSOR.TIME_MIX_W1,
MODEL_TENSOR.TIME_MIX_W2,
MODEL_TENSOR.TIME_MIX_LERP_X,
MODEL_TENSOR.TIME_MIX_LERP_K,
MODEL_TENSOR.TIME_MIX_LERP_V,
MODEL_TENSOR.TIME_MIX_LERP_R,
MODEL_TENSOR.TIME_MIX_LERP_G,
MODEL_TENSOR.TIME_MIX_LERP_W,
MODEL_TENSOR.TIME_MIX_LERP_FUSED,
MODEL_TENSOR.TIME_MIX_FIRST,
MODEL_TENSOR.TIME_MIX_DECAY,
MODEL_TENSOR.TIME_MIX_DECAY_W1,
MODEL_TENSOR.TIME_MIX_DECAY_W2,
MODEL_TENSOR.TIME_MIX_KEY,
MODEL_TENSOR.TIME_MIX_VALUE,
MODEL_TENSOR.TIME_MIX_RECEPTANCE,
MODEL_TENSOR.TIME_MIX_GATE,
MODEL_TENSOR.TIME_MIX_LN,
MODEL_TENSOR.TIME_MIX_OUTPUT,
MODEL_TENSOR.FFN_NORM,
MODEL_TENSOR.FFN_GATE,
MODEL_TENSOR.FFN_DOWN,
MODEL_TENSOR.FFN_UP,
],
MODEL_ARCH.MAMBA: [
MODEL_TENSOR.TOKEN_EMBD,
MODEL_TENSOR.OUTPUT_NORM,
+3
View File
@@ -743,6 +743,9 @@ class GGUFWriter:
def add_wkv_head_size(self, size: int) -> None:
self.add_uint32(Keys.WKV.HEAD_SIZE.format(arch=self.arch), size)
def add_token_shift_count(self, count: int) -> None:
self.add_uint32(Keys.LLM.TOKEN_SHIFT_COUNT.format(arch=self.arch), count)
def add_layer_norm_eps(self, value: float) -> None:
self.add_float32(Keys.Attention.LAYERNORM_EPS.format(arch=self.arch), value)
+41 -22
View File
@@ -13,7 +13,7 @@ class TensorNameMap:
"transformer.wte", # gpt2 gpt-j mpt refact qwen dbrx jais exaone
"transformer.word_embeddings", # falcon
"word_embeddings", # bloom
"model.embed_tokens", # llama-hf nemotron olmoe olmo2
"model.embed_tokens", # llama-hf nemotron olmoe olmo2 rwkv6qwen2
"tok_embeddings", # llama-pth
"embeddings.word_embeddings", # bert nomic-bert
"language_model.embedding.word_embeddings", # persimmon
@@ -55,7 +55,7 @@ class TensorNameMap:
# Output
MODEL_TENSOR.OUTPUT: (
"embed_out", # gptneox
"lm_head", # gpt2 mpt falcon llama-hf baichuan qwen mamba dbrx jais nemotron exaone olmoe olmo2
"lm_head", # gpt2 mpt falcon llama-hf baichuan qwen mamba dbrx jais nemotron exaone olmoe olmo2 phimoe
"output", # llama-pth bloom internlm2
"word_embeddings_for_head", # persimmon
"lm_head.linear", # phi2
@@ -68,7 +68,7 @@ class TensorNameMap:
MODEL_TENSOR.OUTPUT_NORM: (
"gpt_neox.final_layer_norm", # gptneox
"transformer.ln_f", # gpt2 gpt-j falcon jais exaone
"model.norm", # llama-hf baichuan internlm2 olmoe olmo2
"model.norm", # llama-hf baichuan internlm2 olmoe olmo2 phimoe
"norm", # llama-pth
"transformer.norm_f", # mpt dbrx
"ln_f", # refact bloom qwen gpt2
@@ -108,7 +108,7 @@ class TensorNameMap:
"transformer.h.{bid}.input_layernorm", # falcon7b
"h.{bid}.input_layernorm", # bloom
"transformer.h.{bid}.ln_mlp", # falcon40b
"model.layers.{bid}.input_layernorm", # llama-hf nemotron olmoe
"model.layers.{bid}.input_layernorm", # llama-hf nemotron olmoe phimoe
"layers.{bid}.attention_norm", # llama-pth
"language_model.encoder.layers.{bid}.input_layernorm", # persimmon
"model.layers.{bid}.ln1", # yi
@@ -152,7 +152,7 @@ class TensorNameMap:
# Attention query
MODEL_TENSOR.ATTN_Q: (
"model.layers.{bid}.self_attn.q_proj", # llama-hf nemotron olmoe olmo2
"model.layers.{bid}.self_attn.q_proj", # llama-hf nemotron olmoe olmo2 phimoe
"model.layers.{bid}.self_attn.q_proj_no_perm", # llama-custom
"layers.{bid}.attention.wq", # llama-pth
"encoder.layer.{bid}.attention.self.query", # bert
@@ -165,7 +165,7 @@ class TensorNameMap:
# Attention key
MODEL_TENSOR.ATTN_K: (
"model.layers.{bid}.self_attn.k_proj", # llama-hf nemotron olmoe olmo2
"model.layers.{bid}.self_attn.k_proj", # llama-hf nemotron olmoe olmo2 phimoe
"model.layers.{bid}.self_attn.k_proj_no_perm", # llama-custom
"layers.{bid}.attention.wk", # llama-pth
"encoder.layer.{bid}.attention.self.key", # bert
@@ -179,7 +179,7 @@ class TensorNameMap:
# Attention value
MODEL_TENSOR.ATTN_V: (
"model.layers.{bid}.self_attn.v_proj", # llama-hf nemotron olmoe olmo2
"model.layers.{bid}.self_attn.v_proj", # llama-hf nemotron olmoe olmo2 phimoe
"layers.{bid}.attention.wv", # llama-pth
"encoder.layer.{bid}.attention.self.value", # bert
"transformer.h.{bid}.attn.v_proj", # gpt-j
@@ -197,7 +197,7 @@ class TensorNameMap:
"transformer.blocks.{bid}.attn.out_proj", # mpt
"transformer.h.{bid}.self_attention.dense", # falcon
"h.{bid}.self_attention.dense", # bloom
"model.layers.{bid}.self_attn.o_proj", # llama-hf nemotron olmoe olmo2
"model.layers.{bid}.self_attn.o_proj", # llama-hf nemotron olmoe olmo2 phimoe
"model.layers.{bid}.self_attn.linear_attn", # deci
"layers.{bid}.attention.wo", # llama-pth
"encoder.layer.{bid}.attention.output.dense", # bert
@@ -242,7 +242,7 @@ class TensorNameMap:
"transformer.h.{bid}.ln_2", # gpt2 refact qwen jais exaone
"h.{bid}.post_attention_layernorm", # bloom
"transformer.blocks.{bid}.norm_2", # mpt
"model.layers.{bid}.post_attention_layernorm", # llama-hf nemotron olmoe
"model.layers.{bid}.post_attention_layernorm", # llama-hf nemotron olmoe phimoe
"layers.{bid}.ffn_norm", # llama-pth
"language_model.encoder.layers.{bid}.post_attention_layernorm", # persimmon
"model.layers.{bid}.ln2", # yi
@@ -265,7 +265,7 @@ class TensorNameMap:
MODEL_TENSOR.FFN_GATE_INP: (
"layers.{bid}.feed_forward.gate", # mixtral
"model.layers.{bid}.block_sparse_moe.gate", # mixtral
"model.layers.{bid}.block_sparse_moe.gate", # mixtral phimoe
"model.layers.{bid}.mlp.gate", # qwen2moe olmoe
"transformer.decoder_layer.{bid}.router", # Grok
"transformer.blocks.{bid}.ffn.router.layer", # dbrx
@@ -310,10 +310,11 @@ class TensorNameMap:
),
MODEL_TENSOR.FFN_UP_EXP: (
"layers.{bid}.feed_forward.experts.w3", # mixtral (merged)
"transformer.decoder_layer.{bid}.moe.linear_v", # Grok (merged)
"transformer.blocks.{bid}.ffn.experts.mlp.v1", # dbrx
"model.layers.{bid}.mlp.experts.up_proj", # qwen2moe olmoe (merged)
"layers.{bid}.feed_forward.experts.w3", # mixtral (merged)
"transformer.decoder_layer.{bid}.moe.linear_v", # Grok (merged)
"transformer.blocks.{bid}.ffn.experts.mlp.v1", # dbrx
"model.layers.{bid}.mlp.experts.up_proj", # qwen2moe olmoe (merged)
"model.layers.{bid}.block_sparse_moe.experts.w3", # phimoe (merged)
),
MODEL_TENSOR.FFN_UP_SHEXP: (
@@ -342,10 +343,11 @@ class TensorNameMap:
),
MODEL_TENSOR.FFN_GATE_EXP: (
"layers.{bid}.feed_forward.experts.w1", # mixtral (merged)
"transformer.decoder_layer.{bid}.moe.linear", # Grok (merged)
"transformer.blocks.{bid}.ffn.experts.mlp.w1", # dbrx
"model.layers.{bid}.mlp.experts.gate_proj", # qwen2moe olmoe (merged)
"layers.{bid}.feed_forward.experts.w1", # mixtral (merged)
"transformer.decoder_layer.{bid}.moe.linear", # Grok (merged)
"transformer.blocks.{bid}.ffn.experts.mlp.w1", # dbrx
"model.layers.{bid}.mlp.experts.gate_proj", # qwen2moe olmoe (merged)
"model.layers.{bid}.block_sparse_moe.experts.w1", # phimoe (merged)
),
MODEL_TENSOR.FFN_GATE_SHEXP: (
@@ -387,6 +389,7 @@ class TensorNameMap:
"transformer.blocks.{bid}.ffn.experts.mlp.w2", # dbrx
"model.layers.{bid}.mlp.experts.down_proj", # qwen2moe olmoe (merged)
"model.layers.{bid}.block_sparse_moe.output_linear", # granitemoe
"model.layers.{bid}.block_sparse_moe.experts.w2", # phimoe (merged)
),
MODEL_TENSOR.FFN_DOWN_SHEXP: (
@@ -461,34 +464,42 @@ class TensorNameMap:
MODEL_TENSOR.TIME_MIX_W1: (
"rwkv.blocks.{bid}.attention.time_maa_w1", # rwkv v6
"model.layers.{bid}.self_attn.time_maa_w1", # rwkv6qwen2
),
MODEL_TENSOR.TIME_MIX_W2: (
"rwkv.blocks.{bid}.attention.time_maa_w2", # rwkv v6
"model.layers.{bid}.self_attn.time_maa_w2", # rwkv6qwen2
),
MODEL_TENSOR.TIME_MIX_LERP_X: (
"rwkv.blocks.{bid}.attention.time_maa_x", # rwkv v6
"model.layers.{bid}.self_attn.time_maa_x", # rwkv6qwen2
),
MODEL_TENSOR.TIME_MIX_LERP_K: (
"rwkv.blocks.{bid}.attention.time_maa_k", # rwkv v6
"model.layers.{bid}.self_attn.time_maa_k", # rwkv6qwen2
),
MODEL_TENSOR.TIME_MIX_LERP_V: (
"rwkv.blocks.{bid}.attention.time_maa_v", # rwkv v6
"model.layers.{bid}.self_attn.time_maa_v", # rwkv6qwen2
),
MODEL_TENSOR.TIME_MIX_LERP_R: (
"rwkv.blocks.{bid}.attention.time_maa_r", # rwkv v6
"model.layers.{bid}.self_attn.time_maa_r", # rwkv6qwen2
),
MODEL_TENSOR.TIME_MIX_LERP_G: (
"rwkv.blocks.{bid}.attention.time_maa_g", # rwkv v6
"model.layers.{bid}.self_attn.time_maa_g", # rwkv6qwen2
),
MODEL_TENSOR.TIME_MIX_LERP_W: (
"rwkv.blocks.{bid}.attention.time_maa_w", # rwkv v6
"model.layers.{bid}.self_attn.time_maa_w", # rwkv6qwen2
),
MODEL_TENSOR.TIME_MIX_FIRST: (
@@ -497,30 +508,37 @@ class TensorNameMap:
MODEL_TENSOR.TIME_MIX_DECAY: (
"rwkv.blocks.{bid}.attention.time_decay", # rwkv v6
"model.layers.{bid}.self_attn.time_decay", # rwkv6qwen2
),
MODEL_TENSOR.TIME_MIX_DECAY_W1: (
"rwkv.blocks.{bid}.attention.time_decay_w1", # rwkv v6
"model.layers.{bid}.self_attn.time_decay_w1", # rwkv6qwen2
),
MODEL_TENSOR.TIME_MIX_DECAY_W2: (
"rwkv.blocks.{bid}.attention.time_decay_w2", # rwkv v6
"model.layers.{bid}.self_attn.time_decay_w2", # rwkv6qwen2
),
MODEL_TENSOR.TIME_MIX_KEY: (
"rwkv.blocks.{bid}.attention.key", # rwkv
"rwkv.blocks.{bid}.attention.key", # rwkv
"model.layers.{bid}.self_attn.k_proj", # rwkv6qwen2
),
MODEL_TENSOR.TIME_MIX_VALUE: (
"rwkv.blocks.{bid}.attention.value", # rwkv
"rwkv.blocks.{bid}.attention.value", # rwkv
"model.layers.{bid}.self_attn.v_proj", # rwkv6qwen2
),
MODEL_TENSOR.TIME_MIX_RECEPTANCE: (
"rwkv.blocks.{bid}.attention.receptance", # rwkv
"model.layers.{bid}.self_attn.q_proj", # rwkv6qwen2
),
MODEL_TENSOR.TIME_MIX_GATE: (
"rwkv.blocks.{bid}.attention.gate", # rwkv
"rwkv.blocks.{bid}.attention.gate", # rwkv
"model.layers.{bid}.self_attn.gate", # rwkv6qwen2
),
MODEL_TENSOR.TIME_MIX_LN: (
@@ -528,7 +546,8 @@ class TensorNameMap:
),
MODEL_TENSOR.TIME_MIX_OUTPUT: (
"rwkv.blocks.{bid}.attention.output", # rwkv
"rwkv.blocks.{bid}.attention.output", # rwkv
"model.layers.{bid}.self_attn.o_proj", # rwkv6qwen2
),
MODEL_TENSOR.CHANNEL_MIX_LERP_K: (
+5 -6
View File
@@ -1,12 +1,11 @@
[tool.poetry]
name = "gguf"
version = "0.13.0"
version = "0.14.0"
description = "Read and write ML models in GGUF for GGML"
authors = ["GGML <ggml@ggml.ai>"]
packages = [
{include = "gguf"},
{include = "gguf/py.typed"},
{include = "scripts"},
]
readme = "README.md"
homepage = "https://ggml.ai"
@@ -33,7 +32,7 @@ requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"
[tool.poetry.scripts]
gguf-convert-endian = "scripts:gguf_convert_endian_entrypoint"
gguf-dump = "scripts:gguf_dump_entrypoint"
gguf-set-metadata = "scripts:gguf_set_metadata_entrypoint"
gguf-new-metadata = "scripts:gguf_new_metadata_entrypoint"
gguf-convert-endian = "gguf.scripts:gguf_convert_endian_entrypoint"
gguf-dump = "gguf.scripts:gguf_dump_entrypoint"
gguf-set-metadata = "gguf.scripts:gguf_set_metadata_entrypoint"
gguf-new-metadata = "gguf.scripts:gguf_new_metadata_entrypoint"
Binary file not shown.

Before

Width:  |  Height:  |  Size: 195 KiB

+1 -1
View File
@@ -1 +1 @@
a2af72be7baf5b1f4a33d34e77e509e5e85b7cd7
c8bd0fee71dc8328d93be301bbee06bc10d30429
+18 -6
View File
@@ -242,6 +242,10 @@ static void llama_lora_adapter_init_impl(struct llama_model & model, const char
} else {
ab_map[name].b = cur;
}
} else if (str_endswith(name, "_norm.weight")) {
// TODO: add support for norm vector
// for now, we don't really care because most adapters still work fine without it
continue;
} else {
throw std::runtime_error("LoRA tensor '" + name + "' has unexpected suffix");
}
@@ -251,6 +255,7 @@ static void llama_lora_adapter_init_impl(struct llama_model & model, const char
for (auto & it : ab_map) {
const std::string & name = it.first;
llama_lora_weight & w = it.second;
bool is_token_embd = str_endswith(name, "token_embd.weight");
if (!w.a || !w.b) {
throw std::runtime_error("LoRA tensor pair for '" + name + "' is missing one component");
@@ -259,16 +264,23 @@ static void llama_lora_adapter_init_impl(struct llama_model & model, const char
// device buft and device ctx
auto * model_tensor = llama_model_get_tensor(model, name.c_str());
if (!model_tensor) {
throw std::runtime_error("LoRA tensor '" + name + "' does not exist in base model");
throw std::runtime_error("LoRA tensor '" + name + "' does not exist in base model (hint: maybe wrong base model?)");
}
struct ggml_context * dev_ctx = ctx_for_buft(ggml_backend_buffer_get_type(model_tensor->buffer));
// validate tensor shape
if (model_tensor->ne[0] != w.a->ne[0] || model_tensor->ne[1] != w.b->ne[1]) {
throw std::runtime_error("tensor '" + name + "' has incorrect shape");
}
if (w.a->ne[1] != w.b->ne[0]) {
throw std::runtime_error("lora_a tensor is not transposed (hint: adapter from \"finetune\" example is no longer supported)");
if (is_token_embd) {
// expect B to be non-transposed, A and B are flipped; see llm_build_inp_embd()
if (model_tensor->ne[0] != w.b->ne[1] || model_tensor->ne[1] != w.a->ne[1]) {
throw std::runtime_error("tensor '" + name + "' has incorrect shape (hint: maybe wrong base model?)");
}
} else {
if (model_tensor->ne[0] != w.a->ne[0] || model_tensor->ne[1] != w.b->ne[1]) {
throw std::runtime_error("tensor '" + name + "' has incorrect shape (hint: maybe wrong base model?)");
}
if (w.a->ne[1] != w.b->ne[0]) {
throw std::runtime_error("lora_a tensor is not transposed (hint: adapter from \"finetune\" example is no longer supported)");
}
}
// save tensor to adapter
+7
View File
@@ -45,6 +45,13 @@ struct llama_lora_weight {
struct ggml_tensor * a = nullptr;
struct ggml_tensor * b = nullptr;
// get actual scale based on rank and alpha
float get_scale(float alpha, float adapter_scale) {
const float rank = (float) b->ne[0];
const float scale = alpha ? adapter_scale * alpha / rank : adapter_scale;
return scale;
}
llama_lora_weight() = default;
llama_lora_weight(struct ggml_tensor * a, struct ggml_tensor * b) : a(a), b(b) {}
};
+52
View File
@@ -27,6 +27,7 @@ static const std::map<llm_arch, const char *> LLM_ARCH_NAMES = {
{ LLM_ARCH_QWEN2VL, "qwen2vl" },
{ LLM_ARCH_PHI2, "phi2" },
{ LLM_ARCH_PHI3, "phi3" },
{ LLM_ARCH_PHIMOE, "phimoe" },
{ LLM_ARCH_PLAMO, "plamo" },
{ LLM_ARCH_CODESHELL, "codeshell" },
{ LLM_ARCH_ORION, "orion" },
@@ -56,6 +57,7 @@ static const std::map<llm_arch, const char *> LLM_ARCH_NAMES = {
{ LLM_ARCH_NEMOTRON, "nemotron" },
{ LLM_ARCH_EXAONE, "exaone" },
{ LLM_ARCH_RWKV6, "rwkv6" },
{ LLM_ARCH_RWKV6QWEN2, "rwkv6qwen2" },
{ LLM_ARCH_GRANITE, "granite" },
{ LLM_ARCH_GRANITE_MOE, "granitemoe" },
{ LLM_ARCH_CHAMELEON, "chameleon" },
@@ -105,6 +107,7 @@ static const std::map<llm_kv, const char *> LLM_KV_NAMES = {
{ LLM_KV_TIME_DECAY_EXTRA_DIM, "%s.time_decay_extra_dim" },
{ LLM_KV_RESIDUAL_SCALE, "%s.residual_scale" },
{ LLM_KV_EMBEDDING_SCALE, "%s.embedding_scale" },
{ LLM_KV_TOKEN_SHIFT_COUNT, "%s.token_shift_count" },
{ LLM_KV_ATTENTION_HEAD_COUNT, "%s.attention.head_count" },
{ LLM_KV_ATTENTION_HEAD_COUNT_KV, "%s.attention.head_count_kv" },
@@ -584,6 +587,27 @@ static const std::map<llm_arch, std::map<llm_tensor, const char *>> LLM_TENSOR_N
{ LLM_TENSOR_FFN_UP, "blk.%d.ffn_up" },
},
},
{
LLM_ARCH_PHIMOE,
{
{ LLM_TENSOR_TOKEN_EMBD, "token_embd" },
{ LLM_TENSOR_OUTPUT_NORM, "output_norm" },
{ LLM_TENSOR_OUTPUT, "output" },
{ LLM_TENSOR_ROPE_FACTORS_LONG, "rope_factors_long" },
{ LLM_TENSOR_ROPE_FACTORS_SHORT, "rope_factors_short" },
{ LLM_TENSOR_ATTN_NORM, "blk.%d.attn_norm" },
{ LLM_TENSOR_ATTN_QKV, "blk.%d.attn_qkv" },
{ LLM_TENSOR_ATTN_Q, "blk.%d.attn_q" },
{ LLM_TENSOR_ATTN_K, "blk.%d.attn_k" },
{ LLM_TENSOR_ATTN_V, "blk.%d.attn_v" },
{ LLM_TENSOR_ATTN_OUT, "blk.%d.attn_output" },
{ LLM_TENSOR_FFN_NORM, "blk.%d.ffn_norm" },
{ LLM_TENSOR_FFN_GATE_INP, "blk.%d.ffn_gate_inp" },
{ LLM_TENSOR_FFN_GATE_EXPS, "blk.%d.ffn_gate_exps" },
{ LLM_TENSOR_FFN_DOWN_EXPS, "blk.%d.ffn_down_exps" },
{ LLM_TENSOR_FFN_UP_EXPS, "blk.%d.ffn_up_exps" },
},
},
{
LLM_ARCH_PLAMO,
{
@@ -1144,6 +1168,7 @@ static const std::map<llm_arch, std::map<llm_tensor, const char *>> LLM_TENSOR_N
{ LLM_TENSOR_TIME_MIX_LERP_V, "blk.%d.time_mix_lerp_v" },
{ LLM_TENSOR_TIME_MIX_LERP_R, "blk.%d.time_mix_lerp_r" },
{ LLM_TENSOR_TIME_MIX_LERP_G, "blk.%d.time_mix_lerp_g" },
{ LLM_TENSOR_TIME_MIX_LERP_FUSED, "blk.%d.time_mix_lerp_fused" },
{ LLM_TENSOR_TIME_MIX_FIRST, "blk.%d.time_mix_first" },
{ LLM_TENSOR_TIME_MIX_DECAY, "blk.%d.time_mix_decay" },
{ LLM_TENSOR_TIME_MIX_DECAY_W1, "blk.%d.time_mix_decay_w1" },
@@ -1161,6 +1186,32 @@ static const std::map<llm_arch, std::map<llm_tensor, const char *>> LLM_TENSOR_N
{ LLM_TENSOR_CHANNEL_MIX_RECEPTANCE, "blk.%d.channel_mix_receptance" },
},
},
{
LLM_ARCH_RWKV6QWEN2,
{
{ LLM_TENSOR_TOKEN_EMBD, "token_embd" },
{ LLM_TENSOR_OUTPUT_NORM, "output_norm" },
{ LLM_TENSOR_OUTPUT, "output" },
{ LLM_TENSOR_ATTN_NORM, "blk.%d.attn_norm" },
{ LLM_TENSOR_TIME_MIX_W1, "blk.%d.time_mix_w1" },
{ LLM_TENSOR_TIME_MIX_W2, "blk.%d.time_mix_w2" },
{ LLM_TENSOR_TIME_MIX_LERP_X, "blk.%d.time_mix_lerp_x" },
{ LLM_TENSOR_TIME_MIX_LERP_FUSED, "blk.%d.time_mix_lerp_fused" },
{ LLM_TENSOR_TIME_MIX_FIRST, "blk.%d.time_mix_first" },
{ LLM_TENSOR_TIME_MIX_DECAY, "blk.%d.time_mix_decay" },
{ LLM_TENSOR_TIME_MIX_DECAY_W1, "blk.%d.time_mix_decay_w1" },
{ LLM_TENSOR_TIME_MIX_DECAY_W2, "blk.%d.time_mix_decay_w2" },
{ LLM_TENSOR_TIME_MIX_KEY, "blk.%d.time_mix_key" },
{ LLM_TENSOR_TIME_MIX_VALUE, "blk.%d.time_mix_value" },
{ LLM_TENSOR_TIME_MIX_RECEPTANCE, "blk.%d.time_mix_receptance" },
{ LLM_TENSOR_TIME_MIX_GATE, "blk.%d.time_mix_gate" },
{ LLM_TENSOR_TIME_MIX_OUTPUT, "blk.%d.time_mix_output" },
{ LLM_TENSOR_FFN_NORM, "blk.%d.ffn_norm" },
{ LLM_TENSOR_FFN_GATE, "blk.%d.ffn_gate" },
{ LLM_TENSOR_FFN_DOWN, "blk.%d.ffn_down" },
{ LLM_TENSOR_FFN_UP, "blk.%d.ffn_up" },
},
},
{
LLM_ARCH_GRANITE,
{
@@ -1343,6 +1394,7 @@ static const std::map<llm_tensor, llm_tensor_info> LLM_TENSOR_INFOS = {
{LLM_TENSOR_TIME_MIX_LERP_V, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_ADD}},
{LLM_TENSOR_TIME_MIX_LERP_R, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_ADD}},
{LLM_TENSOR_TIME_MIX_LERP_G, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_ADD}},
{LLM_TENSOR_TIME_MIX_LERP_FUSED, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_ADD}},
{LLM_TENSOR_TIME_MIX_DECAY, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_ADD}},
{LLM_TENSOR_TIME_MIX_FIRST, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_RWKV_WKV6}},
{LLM_TENSOR_ATTN_NORM, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL}},
+4
View File
@@ -31,6 +31,7 @@ enum llm_arch {
LLM_ARCH_QWEN2VL,
LLM_ARCH_PHI2,
LLM_ARCH_PHI3,
LLM_ARCH_PHIMOE,
LLM_ARCH_PLAMO,
LLM_ARCH_CODESHELL,
LLM_ARCH_ORION,
@@ -60,6 +61,7 @@ enum llm_arch {
LLM_ARCH_NEMOTRON,
LLM_ARCH_EXAONE,
LLM_ARCH_RWKV6,
LLM_ARCH_RWKV6QWEN2,
LLM_ARCH_GRANITE,
LLM_ARCH_GRANITE_MOE,
LLM_ARCH_CHAMELEON,
@@ -109,6 +111,7 @@ enum llm_kv {
LLM_KV_TIME_DECAY_EXTRA_DIM,
LLM_KV_RESIDUAL_SCALE,
LLM_KV_EMBEDDING_SCALE,
LLM_KV_TOKEN_SHIFT_COUNT,
LLM_KV_ATTENTION_HEAD_COUNT,
LLM_KV_ATTENTION_HEAD_COUNT_KV,
@@ -252,6 +255,7 @@ enum llm_tensor {
LLM_TENSOR_TIME_MIX_LERP_V,
LLM_TENSOR_TIME_MIX_LERP_R,
LLM_TENSOR_TIME_MIX_LERP_G,
LLM_TENSOR_TIME_MIX_LERP_FUSED,
LLM_TENSOR_TIME_MIX_FIRST,
LLM_TENSOR_TIME_MIX_DECAY,
LLM_TENSOR_TIME_MIX_DECAY_W1,
+12 -1
View File
@@ -35,6 +35,7 @@ static const std::map<std::string, llm_chat_template> LLM_CHAT_TEMPLATES = {
{ "mistral-v3-tekken", LLM_CHAT_TEMPLATE_MISTRAL_V3_TEKKEN },
{ "mistral-v7", LLM_CHAT_TEMPLATE_MISTRAL_V7 },
{ "phi3", LLM_CHAT_TEMPLATE_PHI_3 },
{ "phi4", LLM_CHAT_TEMPLATE_PHI_4 },
{ "falcon3", LLM_CHAT_TEMPLATE_FALCON_3 },
{ "zephyr", LLM_CHAT_TEMPLATE_ZEPHYR },
{ "monarch", LLM_CHAT_TEMPLATE_MONARCH },
@@ -73,7 +74,9 @@ llm_chat_template llm_chat_detect_template(const std::string & tmpl) {
return tmpl.find(haystack) != std::string::npos;
};
if (tmpl_contains("<|im_start|>")) {
return LLM_CHAT_TEMPLATE_CHATML;
return tmpl_contains("<|im_sep|>")
? LLM_CHAT_TEMPLATE_PHI_4
: LLM_CHAT_TEMPLATE_CHATML;
} else if (tmpl.find("mistral") == 0 || tmpl_contains("[INST]")) {
if (tmpl_contains("[SYSTEM_PROMPT]")) {
return LLM_CHAT_TEMPLATE_MISTRAL_V7;
@@ -269,6 +272,14 @@ int32_t llm_chat_apply_template(
if (add_ass) {
ss << "<|assistant|>\n";
}
} else if (tmpl == LLM_CHAT_TEMPLATE_PHI_4) {
// chatml template
for (auto message : chat) {
ss << "<|im_start|>" << message->role << "<|im_sep|>" << message->content << "<|im_end|>";
}
if (add_ass) {
ss << "<|im_start|>assistant<|im_sep|>";
}
} else if (tmpl == LLM_CHAT_TEMPLATE_FALCON_3) {
// Falcon 3
for (auto message : chat) {
+1
View File
@@ -15,6 +15,7 @@ enum llm_chat_template {
LLM_CHAT_TEMPLATE_MISTRAL_V3_TEKKEN,
LLM_CHAT_TEMPLATE_MISTRAL_V7,
LLM_CHAT_TEMPLATE_PHI_3,
LLM_CHAT_TEMPLATE_PHI_4,
LLM_CHAT_TEMPLATE_FALCON_3,
LLM_CHAT_TEMPLATE_ZEPHYR,
LLM_CHAT_TEMPLATE_MONARCH,
+1 -1
View File
@@ -52,7 +52,7 @@ uint32_t llama_hparams::n_embd_v_gqa(uint32_t il) const {
uint32_t llama_hparams::n_embd_k_s() const {
if (wkv_head_size != 0) {
// for RWKV models
return 2 * n_embd;
return token_shift_count * n_embd;
}
// TODO: maybe support other convolution strides than 1
+1
View File
@@ -76,6 +76,7 @@ struct llama_hparams {
uint32_t time_mix_extra_dim = 0;
uint32_t time_decay_extra_dim = 0;
uint32_t wkv_head_size = 0;
uint32_t token_shift_count = 2;
float rope_attn_factor = 1.0f;
float rope_freq_base_train;
+2 -1
View File
@@ -1,5 +1,6 @@
#include "llama-impl.h"
#include "gguf.h"
#include "llama.h"
#include <cinttypes>
@@ -138,7 +139,7 @@ std::string gguf_kv_to_str(const struct gguf_context * ctx_gguf, int i) {
{
const enum gguf_type arr_type = gguf_get_arr_type(ctx_gguf, i);
int arr_n = gguf_get_arr_n(ctx_gguf, i);
const void * data = gguf_get_arr_data(ctx_gguf, i);
const void * data = arr_type == GGUF_TYPE_STRING ? nullptr : gguf_get_arr_data(ctx_gguf, i);
std::stringstream ss;
ss << "[";
for (int j = 0; j < arr_n; j++) {
+5 -4
View File
@@ -18,7 +18,7 @@ const char * llama_file_version_name(llama_fver version) {
}
namespace GGUFMeta {
template <typename T, gguf_type gt_, T (*gfun)(const gguf_context *, const int)>
template <typename T, gguf_type gt_, T (*gfun)(const gguf_context *, const int64_t)>
struct GKV_Base_Type {
static constexpr gguf_type gt = gt_;
@@ -60,10 +60,11 @@ namespace GGUFMeta {
public:
static constexpr gguf_type gt = GGUF_TYPE_ARRAY;
static ArrayInfo getter(const gguf_context *ctx, const int k) {
const enum gguf_type arr_type = gguf_get_arr_type(ctx, k);
return ArrayInfo {
gguf_get_arr_type(ctx, k),
arr_type,
size_t(gguf_get_arr_n(ctx, k)),
gguf_get_arr_data(ctx, k),
arr_type == GGUF_TYPE_STRING ? nullptr : gguf_get_arr_data(ctx, k),
};
}
};
@@ -553,7 +554,7 @@ llama_model_loader::llama_model_loader(const std::string & fname, bool use_mmap,
const enum gguf_type type = gguf_get_kv_type(meta.get(), i);
const std::string type_name =
type == GGUF_TYPE_ARRAY
? format("%s[%s,%d]", gguf_type_name(type), gguf_type_name(gguf_get_arr_type(meta.get(), i)), gguf_get_arr_n(meta.get(), i))
? format("%s[%s,%zu]", gguf_type_name(type), gguf_type_name(gguf_get_arr_type(meta.get(), i)), gguf_get_arr_n(meta.get(), i))
: gguf_type_name(type);
std::string value = gguf_kv_to_str(meta.get(), i);
+18 -1
View File
@@ -76,6 +76,7 @@ const char * llm_type_name(llm_type type) {
case MODEL_8x7B: return "8x7B";
case MODEL_8x22B: return "8x22B";
case MODEL_16x12B: return "16x12B";
case MODEL_16x3_8B: return "16x3.8B";
case MODEL_10B_128x3_66B: return "10B+128x3.66B";
case MODEL_57B_A14B: return "57B.A14B";
case MODEL_27B: return "27B";
@@ -661,6 +662,15 @@ void llm_load_hparams(llama_model_loader & ml, llama_model & model) {
throw std::runtime_error("invalid value for sliding_window");
}
} break;
case LLM_ARCH_PHIMOE:
{
ml.get_key(LLM_KV_ATTENTION_LAYERNORM_RMS_EPS, hparams.f_norm_rms_eps);
switch (hparams.n_layer) {
case 32: model.type = e_model::MODEL_16x3_8B; break;
default: model.type = e_model::MODEL_UNKNOWN;
}
} break;
case LLM_ARCH_PLAMO:
{
ml.get_key(LLM_KV_ATTENTION_LAYERNORM_RMS_EPS, hparams.f_norm_rms_eps);
@@ -1044,12 +1054,15 @@ void llm_load_hparams(llama_model_loader & ml, llama_model & model) {
}
} break;
case LLM_ARCH_RWKV6:
case LLM_ARCH_RWKV6QWEN2:
{
ml.get_key(LLM_KV_ATTENTION_LAYERNORM_EPS, hparams.f_norm_eps);
ml.get_key(LLM_KV_ATTENTION_LAYERNORM_EPS, hparams.f_norm_eps, false);
ml.get_key(LLM_KV_ATTENTION_LAYERNORM_RMS_EPS, hparams.f_norm_rms_eps, false);
ml.get_key(LLM_KV_WKV_HEAD_SIZE, hparams.wkv_head_size);
ml.get_key(LLM_KV_TIME_MIX_EXTRA_DIM, hparams.time_mix_extra_dim);
ml.get_key(LLM_KV_TIME_DECAY_EXTRA_DIM, hparams.time_decay_extra_dim);
ml.get_key(LLM_KV_RESCALE_EVERY_N_LAYERS, hparams.rescale_every_n_layers, false);
ml.get_key(LLM_KV_TOKEN_SHIFT_COUNT, hparams.token_shift_count, false);
switch (hparams.n_layer) {
case 24: model.type = e_model::MODEL_1_6B; break;
@@ -1060,6 +1073,7 @@ void llm_load_hparams(llama_model_loader & ml, llama_model & model) {
default: model.type = e_model::MODEL_UNKNOWN;
} break;
case 61: model.type = e_model::MODEL_14B; break;
case 64: model.type = e_model::MODEL_32B; break;
default: model.type = e_model::MODEL_UNKNOWN;
}
} break;
@@ -2054,6 +2068,7 @@ enum llama_rope_type llama_rope_type(const struct llama_model * model) {
case LLM_ARCH_T5ENCODER:
case LLM_ARCH_JAIS:
case LLM_ARCH_RWKV6:
case LLM_ARCH_RWKV6QWEN2:
case LLM_ARCH_WAVTOKENIZER_DEC:
return LLAMA_ROPE_TYPE_NONE;
@@ -2094,6 +2109,7 @@ enum llama_rope_type llama_rope_type(const struct llama_model * model) {
case LLM_ARCH_OLMOE:
case LLM_ARCH_PHI2:
case LLM_ARCH_PHI3:
case LLM_ARCH_PHIMOE:
case LLM_ARCH_GEMMA:
case LLM_ARCH_GEMMA2:
case LLM_ARCH_STARCODER2:
@@ -2197,6 +2213,7 @@ bool llama_model_is_recurrent(const struct llama_model * model) {
switch (model->arch) {
case LLM_ARCH_MAMBA: return true;
case LLM_ARCH_RWKV6: return true;
case LLM_ARCH_RWKV6QWEN2: return true;
default: return false;
}
}
+13 -8
View File
@@ -73,6 +73,7 @@ enum llm_type {
MODEL_8x7B,
MODEL_8x22B,
MODEL_16x12B,
MODEL_16x3_8B,
MODEL_10B_128x3_66B,
MODEL_57B_A14B,
MODEL_27B,
@@ -240,15 +241,19 @@ struct llama_layer {
struct ggml_tensor * time_mix_lerp_v = nullptr;
struct ggml_tensor * time_mix_lerp_r = nullptr;
struct ggml_tensor * time_mix_lerp_g = nullptr;
struct ggml_tensor * time_mix_lerp_fused = nullptr;
struct ggml_tensor * time_mix_first = nullptr;
struct ggml_tensor * time_mix_decay = nullptr;
struct ggml_tensor * time_mix_decay_w1 = nullptr;
struct ggml_tensor * time_mix_decay_w2 = nullptr;
struct ggml_tensor * time_mix_key = nullptr;
struct ggml_tensor * time_mix_value = nullptr;
struct ggml_tensor * time_mix_receptance = nullptr;
struct ggml_tensor * time_mix_gate = nullptr;
struct ggml_tensor * time_mix_first = nullptr;
struct ggml_tensor * time_mix_decay = nullptr;
struct ggml_tensor * time_mix_decay_w1 = nullptr;
struct ggml_tensor * time_mix_decay_w2 = nullptr;
struct ggml_tensor * time_mix_key = nullptr;
struct ggml_tensor * time_mix_key_b = nullptr;
struct ggml_tensor * time_mix_value = nullptr;
struct ggml_tensor * time_mix_value_b = nullptr;
struct ggml_tensor * time_mix_receptance = nullptr;
struct ggml_tensor * time_mix_receptance_b = nullptr;
struct ggml_tensor * time_mix_gate = nullptr;
struct ggml_tensor * time_mix_ln = nullptr;
struct ggml_tensor * time_mix_ln_b = nullptr;
+18 -16
View File
@@ -7,14 +7,12 @@
#include <algorithm>
#include <cmath>
#include <cstring>
#include <cinttypes>
#include <fstream>
#include <mutex>
#include <thread>
#include <unordered_map>
// TODO: replace with ggml API call
#define QK_K 256
static void zeros(std::ofstream & file, size_t n) {
char zero = 0;
for (size_t i = 0; i < n; ++i) {
@@ -154,8 +152,10 @@ static ggml_type llama_tensor_get_type(quantize_state_impl & qs, ggml_type new_t
if (qs.params->output_tensor_type < GGML_TYPE_COUNT) {
new_type = qs.params->output_tensor_type;
} else {
int nx = tensor->ne[0];
if (arch == LLM_ARCH_FALCON || nx % QK_K != 0) {
const int64_t nx = tensor->ne[0];
const int64_t qk_k = ggml_blck_size(new_type);
if (arch == LLM_ARCH_FALCON || nx % qk_k != 0) {
new_type = GGML_TYPE_Q8_0;
}
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS || ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS ||
@@ -367,20 +367,19 @@ static ggml_type llama_tensor_get_type(quantize_state_impl & qs, ggml_type new_t
// if (ftype == LLAMA_FTYPE_MOSTLY_Q5_K_S) new_type = GGML_TYPE_Q4_K;
//}
bool convert_incompatible_tensor = false;
if (new_type == GGML_TYPE_Q2_K || new_type == GGML_TYPE_Q3_K || new_type == GGML_TYPE_Q4_K ||
new_type == GGML_TYPE_Q5_K || new_type == GGML_TYPE_Q6_K || new_type == GGML_TYPE_IQ4_XS ||
new_type == GGML_TYPE_IQ2_XS || new_type == GGML_TYPE_IQ2_XXS || new_type == GGML_TYPE_IQ2_S ||
new_type == GGML_TYPE_IQ3_XXS || new_type == GGML_TYPE_IQ1_S || new_type == GGML_TYPE_IQ3_S ||
new_type == GGML_TYPE_IQ1_M) {
int nx = tensor->ne[0];
int ny = tensor->ne[1];
if (nx % QK_K != 0) {
LLAMA_LOG_WARN("\n\n%s : tensor cols %d x %d are not divisible by %d, required for %s", __func__, nx, ny, QK_K, ggml_type_name(new_type));
{
const int64_t nx = tensor->ne[0];
const int64_t ny = tensor->ne[1];
const int64_t qk_k = ggml_blck_size(new_type);
if (nx % qk_k != 0) {
LLAMA_LOG_WARN("\n\n%s : tensor cols %" PRId64 " x %" PRId64 " are not divisible by %" PRId64 ", required for %s", __func__, nx, ny, qk_k, ggml_type_name(new_type));
convert_incompatible_tensor = true;
} else {
++qs.n_k_quantized;
}
}
if (convert_incompatible_tensor) {
switch (new_type) {
case GGML_TYPE_TQ1_0:
@@ -621,7 +620,8 @@ static void llama_model_quantize_impl(const std::string & fname_inp, const std::
qs.n_ffn_down = qs.n_ffn_gate = qs.n_ffn_up = (int)model.hparams.n_layer;
// sanity checks
// sanity checks for models that have attention layers
if (qs.n_attention_wv != 0)
{
const auto & n_head_kv_iter = model.hparams.n_head_kv_arr.begin();
// attention layers have a non-zero number of kv heads
@@ -759,6 +759,7 @@ static void llama_model_quantize_impl(const std::string & fname_inp, const std::
quantize &= name.find("time_mix_w2.weight") == std::string::npos;
quantize &= name.find("time_mix_decay_w1.weight") == std::string::npos;
quantize &= name.find("time_mix_decay_w2.weight") == std::string::npos;
quantize &= name.find("time_mix_lerp_fused.weight") == std::string::npos;
// do not quantize relative position bias (T5)
quantize &= name.find("attn_rel_b.weight") == std::string::npos;
@@ -875,7 +876,8 @@ static void llama_model_quantize_impl(const std::string & fname_inp, const std::
// update the gguf meta data as we go
gguf_set_tensor_type(ctx_outs[cur_split].get(), name.c_str(), new_type);
gguf_set_tensor_data(ctx_outs[cur_split].get(), name.c_str(), new_data, new_size);
GGML_ASSERT(gguf_get_tensor_size(ctx_outs[cur_split].get(), gguf_find_tensor(ctx_outs[cur_split].get(), name.c_str())) == new_size);
gguf_set_tensor_data(ctx_outs[cur_split].get(), name.c_str(), new_data);
// write tensor data + padding
fout.write((const char *) new_data, new_size);
+351 -95
View File
@@ -134,11 +134,11 @@ static bool weight_buft_supported(const llama_hparams & hparams, ggml_tensor * w
const int64_t H = 123;
const int64_t n_tokens = 123;
const int64_t n_seqs = 123;
ggml_tensor * k = ggml_new_tensor_4d(ctx, GGML_TYPE_F32, S, 1, H, n_tokens);
ggml_tensor * v = ggml_new_tensor_4d(ctx, GGML_TYPE_F32, 1, S, H, n_tokens);
ggml_tensor * r = ggml_new_tensor_4d(ctx, GGML_TYPE_F32, 1, S, H, n_tokens);
ggml_tensor * k = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, S, H, n_tokens);
ggml_tensor * v = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, S, H, n_tokens);
ggml_tensor * r = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, S, H, n_tokens);
ggml_tensor * tf = w;
ggml_tensor * td = ggml_new_tensor_4d(ctx, GGML_TYPE_F32, 1, S, H, n_tokens);
ggml_tensor * td = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, S, H, n_tokens);
ggml_tensor * state = ggml_new_tensor_4d(ctx, GGML_TYPE_F32, S, n_seqs, S, H);
op_tensor = ggml_rwkv_wkv6(ctx, k, v, r, tf, td, state);
} break;
@@ -1212,6 +1212,50 @@ static bool llm_load_tensors(
layer.ffn_down = create_tensor(tn(LLM_TENSOR_FFN_DOWN, "weight", i), { n_ff, n_embd }, 0);
layer.ffn_up = create_tensor(tn(LLM_TENSOR_FFN_UP, "weight", i), { n_embd, 2 * n_ff }, 0);
layer.rope_long = create_tensor(tn(LLM_TENSOR_ROPE_FACTORS_LONG, "weight", i), { n_embd_head/2 }, llama_model_loader::TENSOR_NOT_REQUIRED | (i != 0 ? llama_model_loader::TENSOR_DUPLICATED : 0));
layer.rope_short = create_tensor(tn(LLM_TENSOR_ROPE_FACTORS_SHORT, "weight", i), { n_embd_head/2 }, llama_model_loader::TENSOR_NOT_REQUIRED | (i != 0 ? llama_model_loader::TENSOR_DUPLICATED : 0));
}
} break;
case LLM_ARCH_PHIMOE:
{
const int64_t n_embd_head = n_embd / n_head;
model.tok_embd = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD, "weight"), { n_embd, n_vocab }, 0);
// output
model.output_norm = create_tensor(tn(LLM_TENSOR_OUTPUT_NORM, "weight"), { n_embd }, 0);
model.output_norm_b = create_tensor(tn(LLM_TENSOR_OUTPUT_NORM, "bias"), {n_embd}, 0);
model.output = create_tensor(tn(LLM_TENSOR_OUTPUT, "weight"), { n_embd, n_vocab }, 0);
model.output_b = create_tensor(tn(LLM_TENSOR_OUTPUT, "bias"), { n_vocab }, 0);
for (int i = 0; i < n_layer; ++i) {
auto & layer = model.layers[i];
layer.attn_norm = create_tensor(tn(LLM_TENSOR_ATTN_NORM, "weight", i), { n_embd }, 0);
layer.attn_norm_b = create_tensor(tn(LLM_TENSOR_ATTN_NORM, "bias", i), { n_embd }, 0);
layer.wqkv = create_tensor(tn(LLM_TENSOR_ATTN_QKV, "weight", i), { n_embd, n_embd + 2 * n_embd_gqa }, llama_model_loader::TENSOR_NOT_REQUIRED);
if (layer.wqkv == nullptr) {
layer.wq = create_tensor(tn(LLM_TENSOR_ATTN_Q, "weight", i), {n_embd, n_embd}, 0);
layer.bq = create_tensor(tn(LLM_TENSOR_ATTN_Q, "bias", i), {n_embd}, 0);
layer.wk = create_tensor(tn(LLM_TENSOR_ATTN_K, "weight", i), {n_embd, n_embd_gqa}, 0);
layer.bk = create_tensor(tn(LLM_TENSOR_ATTN_K, "bias", i), {n_embd_gqa}, 0);
layer.wv = create_tensor(tn(LLM_TENSOR_ATTN_V, "weight", i), {n_embd, n_embd_gqa}, 0);
layer.bv = create_tensor(tn(LLM_TENSOR_ATTN_V, "bias", i), {n_embd_gqa}, 0);
}
layer.wo = create_tensor(tn(LLM_TENSOR_ATTN_OUT, "weight", i), { n_embd, n_embd }, 0);
layer.bo = create_tensor(tn(LLM_TENSOR_ATTN_OUT, "bias", i), { n_embd }, 0);
layer.ffn_norm = create_tensor(tn(LLM_TENSOR_FFN_NORM, "weight", i), { n_embd }, 0);
layer.ffn_norm_b = create_tensor(tn(LLM_TENSOR_FFN_NORM, "bias", i), { n_embd }, 0);
layer.ffn_gate_inp = create_tensor(tn(LLM_TENSOR_FFN_GATE_INP, "weight", i), {n_embd, n_expert}, 0);
layer.ffn_gate_exps = create_tensor(tn(LLM_TENSOR_FFN_GATE_EXPS, "weight", i), {n_embd, n_ff, n_expert}, 0);
layer.ffn_down_exps = create_tensor(tn(LLM_TENSOR_FFN_DOWN_EXPS, "weight", i), {n_ff, n_embd, n_expert}, 0);
layer.ffn_up_exps = create_tensor(tn(LLM_TENSOR_FFN_UP_EXPS, "weight", i), {n_embd, n_ff, n_expert}, 0);
layer.rope_long = create_tensor(tn(LLM_TENSOR_ROPE_FACTORS_LONG, "weight", i), { n_embd_head/2 }, llama_model_loader::TENSOR_NOT_REQUIRED | (i != 0 ? llama_model_loader::TENSOR_DUPLICATED : 0));
layer.rope_short = create_tensor(tn(LLM_TENSOR_ROPE_FACTORS_SHORT, "weight", i), { n_embd_head/2 }, llama_model_loader::TENSOR_NOT_REQUIRED | (i != 0 ? llama_model_loader::TENSOR_DUPLICATED : 0));
}
@@ -2142,11 +2186,13 @@ static bool llm_load_tensors(
layer.time_mix_w2 = create_tensor(tn(LLM_TENSOR_TIME_MIX_W2, "weight", i), {time_mix_extra_dim, n_embd, 5}, 0);
layer.time_mix_lerp_x = create_tensor(tn(LLM_TENSOR_TIME_MIX_LERP_X, "weight", i), {n_embd, 1, 1}, 0);
layer.time_mix_lerp_w = create_tensor(tn(LLM_TENSOR_TIME_MIX_LERP_W, "weight", i), {n_embd, 1, 1}, 0);
layer.time_mix_lerp_k = create_tensor(tn(LLM_TENSOR_TIME_MIX_LERP_K, "weight", i), {n_embd, 1, 1}, 0);
layer.time_mix_lerp_v = create_tensor(tn(LLM_TENSOR_TIME_MIX_LERP_V, "weight", i), {n_embd, 1, 1}, 0);
layer.time_mix_lerp_r = create_tensor(tn(LLM_TENSOR_TIME_MIX_LERP_R, "weight", i), {n_embd, 1, 1}, 0);
layer.time_mix_lerp_g = create_tensor(tn(LLM_TENSOR_TIME_MIX_LERP_G, "weight", i), {n_embd, 1, 1}, 0);
layer.time_mix_lerp_w = create_tensor(tn(LLM_TENSOR_TIME_MIX_LERP_W, "weight", i), {n_embd, 1, 1}, llama_model_loader::TENSOR_NOT_REQUIRED);
layer.time_mix_lerp_k = create_tensor(tn(LLM_TENSOR_TIME_MIX_LERP_K, "weight", i), {n_embd, 1, 1}, llama_model_loader::TENSOR_NOT_REQUIRED);
layer.time_mix_lerp_v = create_tensor(tn(LLM_TENSOR_TIME_MIX_LERP_V, "weight", i), {n_embd, 1, 1}, llama_model_loader::TENSOR_NOT_REQUIRED);
layer.time_mix_lerp_r = create_tensor(tn(LLM_TENSOR_TIME_MIX_LERP_R, "weight", i), {n_embd, 1, 1}, llama_model_loader::TENSOR_NOT_REQUIRED);
layer.time_mix_lerp_g = create_tensor(tn(LLM_TENSOR_TIME_MIX_LERP_G, "weight", i), {n_embd, 1, 1}, llama_model_loader::TENSOR_NOT_REQUIRED);
layer.time_mix_lerp_fused = create_tensor(tn(LLM_TENSOR_TIME_MIX_LERP_FUSED, "weight", i), {n_embd, 1, 1, 5}, llama_model_loader::TENSOR_NOT_REQUIRED);
GGML_ASSERT(!(layer.time_mix_lerp_fused == NULL && layer.time_mix_lerp_w == NULL));
layer.time_mix_first = create_tensor(tn(LLM_TENSOR_TIME_MIX_FIRST, "weight", i), {head_size, n_embd / head_size}, 0);
layer.time_mix_decay = create_tensor(tn(LLM_TENSOR_TIME_MIX_DECAY, "weight", i), {n_embd}, 0);
@@ -2170,6 +2216,59 @@ static bool llm_load_tensors(
}
} break;
case LLM_ARCH_RWKV6QWEN2:
{
model.tok_embd = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}, 0);
model.output_norm = create_tensor(tn(LLM_TENSOR_OUTPUT_NORM, "weight"), {n_embd}, 0);
model.output_norm_b = create_tensor(tn(LLM_TENSOR_OUTPUT_NORM, "bias"), {n_embd}, llama_model_loader::TENSOR_NOT_REQUIRED);
model.output = create_tensor(tn(LLM_TENSOR_OUTPUT, "weight"), {n_embd, n_vocab}, 0);
const int time_mix_extra_dim = hparams.time_mix_extra_dim;
const int time_decay_extra_dim = hparams.time_decay_extra_dim;
const int head_size = hparams.wkv_head_size;
const int attn_hidden_size = n_embd;
const int n_head_kv = hparams.n_head_kv();
int attn_key_value_size;
if (n_head_kv == 0 || attn_hidden_size / head_size == n_head_kv) {
attn_key_value_size = attn_hidden_size;
} else {
attn_key_value_size = n_head_kv * head_size;
}
for (int i = 0; i < n_layer; ++i) {
auto & layer = model.layers[i];
layer.attn_norm = create_tensor(tn(LLM_TENSOR_ATTN_NORM, "weight", i), {n_embd}, 0);
layer.time_mix_w1 = create_tensor(tn(LLM_TENSOR_TIME_MIX_W1, "weight", i), {n_embd, time_mix_extra_dim * 5}, 0);
layer.time_mix_w2 = create_tensor(tn(LLM_TENSOR_TIME_MIX_W2, "weight", i), {time_mix_extra_dim, n_embd, 5}, 0);
layer.time_mix_lerp_x = create_tensor(tn(LLM_TENSOR_TIME_MIX_LERP_X, "weight", i), {n_embd, 1, 1}, 0);
layer.time_mix_lerp_fused = create_tensor(tn(LLM_TENSOR_TIME_MIX_LERP_FUSED, "weight", i), {n_embd, 1, 1, 5}, 0);
layer.time_mix_first = create_tensor(tn(LLM_TENSOR_TIME_MIX_FIRST, "weight", i), {head_size, n_embd / head_size}, llama_model_loader::TENSOR_NOT_REQUIRED);
layer.time_mix_decay = create_tensor(tn(LLM_TENSOR_TIME_MIX_DECAY, "weight", i), {n_embd}, 0);
layer.time_mix_decay_w1 = create_tensor(tn(LLM_TENSOR_TIME_MIX_DECAY_W1, "weight", i), {n_embd, time_decay_extra_dim}, 0);
layer.time_mix_decay_w2 = create_tensor(tn(LLM_TENSOR_TIME_MIX_DECAY_W2, "weight", i), {time_decay_extra_dim, attn_hidden_size}, 0);
layer.time_mix_key = create_tensor(tn(LLM_TENSOR_TIME_MIX_KEY, "weight", i), {n_embd, attn_key_value_size}, 0);
layer.time_mix_value = create_tensor(tn(LLM_TENSOR_TIME_MIX_VALUE, "weight", i), {n_embd, attn_key_value_size}, 0);
layer.time_mix_receptance = create_tensor(tn(LLM_TENSOR_TIME_MIX_RECEPTANCE, "weight", i), {attn_hidden_size, n_embd}, 0);
layer.time_mix_gate = create_tensor(tn(LLM_TENSOR_TIME_MIX_GATE, "weight", i), {attn_hidden_size, n_embd}, 0);
// optional bias tensors
layer.time_mix_key_b = create_tensor(tn(LLM_TENSOR_TIME_MIX_KEY, "bias", i), {attn_key_value_size}, llama_model_loader::TENSOR_NOT_REQUIRED);
layer.time_mix_value_b = create_tensor(tn(LLM_TENSOR_TIME_MIX_VALUE, "bias", i), {attn_key_value_size}, llama_model_loader::TENSOR_NOT_REQUIRED);
layer.time_mix_receptance_b = create_tensor(tn(LLM_TENSOR_TIME_MIX_RECEPTANCE, "bias", i), {attn_hidden_size}, llama_model_loader::TENSOR_NOT_REQUIRED);
layer.time_mix_output = create_tensor(tn(LLM_TENSOR_TIME_MIX_OUTPUT, "weight", i), {n_embd, attn_hidden_size}, 0);
layer.ffn_norm = create_tensor(tn(LLM_TENSOR_FFN_NORM, "weight", i), {n_embd}, 0);
layer.ffn_gate = create_tensor(tn(LLM_TENSOR_FFN_GATE, "weight", i), {n_embd, n_ff}, 0);
layer.ffn_down = create_tensor(tn(LLM_TENSOR_FFN_DOWN, "weight", i), { n_ff, n_embd}, 0);
layer.ffn_up = create_tensor(tn(LLM_TENSOR_FFN_UP, "weight", i), {n_embd, n_ff}, 0);
}
} break;
case LLM_ARCH_CHAMELEON:
{
model.tok_embd = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}, 0);
@@ -2545,6 +2644,21 @@ static struct ggml_tensor * llm_build_inp_embd(
ggml_set_input(lctx.inp_tokens);
inpL = ggml_get_rows(ctx, tok_embd, lctx.inp_tokens);
// apply lora for embedding tokens if needed
for (auto & it : lctx.lora_adapters) {
struct llama_lora_weight * lora = it.first->get_weight(tok_embd);
if (lora == nullptr) {
continue;
}
const float adapter_scale = it.second;
const float scale = lora->get_scale(it.first->alpha, adapter_scale);
struct ggml_tensor * inpL_delta = ggml_scale(ctx, ggml_mul_mat(
ctx, lora->b, // non-transposed lora_b
ggml_get_rows(ctx, lora->a, lctx.inp_tokens)
), scale);
inpL = ggml_add(ctx, inpL, inpL_delta);
}
} else {
lctx.inp_embd = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, n_embd, ubatch.n_tokens);
inpL = lctx.inp_embd;
@@ -2617,9 +2731,8 @@ static struct ggml_tensor * llm_build_lora_mm(
if (lora == nullptr) {
continue;
}
const float alpha = it.first->alpha;
const float rank = (float) lora->b->ne[0];
const float scale = alpha ? it.second * alpha / rank : it.second;
const float adapter_scale = it.second;
const float scale = lora->get_scale(it.first->alpha, adapter_scale);
struct ggml_tensor * ab_cur = ggml_mul_mat(
ctx0, lora->b,
ggml_mul_mat(ctx0, lora->a, cur)
@@ -3279,16 +3392,20 @@ static struct ggml_tensor * llm_build_rwkv6_time_mix(
const struct llama_layer * layer,
struct ggml_tensor * cur,
struct ggml_tensor * x_prev,
struct ggml_tensor ** wkv_state) {
struct ggml_tensor ** wkv_state,
size_t wkv_head_size,
size_t head_count_kv) {
size_t n_embd = cur->ne[0];
size_t n_seq_tokens = cur->ne[1];
size_t n_seqs = cur->ne[2];
size_t head_size = layer->time_mix_first->ne[0];
size_t head_count = layer->time_mix_first->ne[1];
size_t head_size = wkv_head_size;
size_t head_count = n_embd / head_size;
size_t n_tokens = n_seqs * n_seq_tokens;
bool is_qrwkv = layer->time_mix_first == nullptr;
struct ggml_tensor * sx = ggml_sub(ctx, x_prev, cur);
sx = ggml_reshape_2d(ctx, sx, n_embd, n_tokens);
@@ -3317,69 +3434,64 @@ static struct ggml_tensor * llm_build_rwkv6_time_mix(
xxx
);
struct ggml_tensor *mw = ggml_view_2d(ctx, xxx, n_embd, n_tokens, xxx->nb[1], 0);
struct ggml_tensor *mk = ggml_view_2d(ctx, xxx, n_embd, n_tokens, xxx->nb[1], n_embd * n_tokens * sizeof(float));
struct ggml_tensor *mv = ggml_view_2d(ctx, xxx, n_embd, n_tokens, xxx->nb[1], n_embd * n_tokens * 2 * sizeof(float));
struct ggml_tensor *mr = ggml_view_2d(ctx, xxx, n_embd, n_tokens, xxx->nb[1], n_embd * n_tokens * 3 * sizeof(float));
struct ggml_tensor *mg = ggml_view_2d(ctx, xxx, n_embd, n_tokens, xxx->nb[1], n_embd * n_tokens * 4 * sizeof(float));
struct ggml_tensor *xw, *xk, *xv, *xr, *xg;
if (layer->time_mix_lerp_fused) {
// fusing these weights makes some performance improvement
sx = ggml_reshape_3d(ctx, sx, n_embd, 1, n_tokens);
cur = ggml_reshape_3d(ctx, cur, n_embd, 1, n_tokens);
xxx = ggml_add(ctx, ggml_mul(ctx, ggml_add(ctx, xxx, layer->time_mix_lerp_fused), sx), cur);
xw = ggml_view_2d(ctx, xxx, n_embd, n_tokens, xxx->nb[1], 0);
xk = ggml_view_2d(ctx, xxx, n_embd, n_tokens, xxx->nb[1], n_embd * n_tokens * sizeof(float));
xv = ggml_view_2d(ctx, xxx, n_embd, n_tokens, xxx->nb[1], n_embd * n_tokens * 2 * sizeof(float));
xr = ggml_view_2d(ctx, xxx, n_embd, n_tokens, xxx->nb[1], n_embd * n_tokens * 3 * sizeof(float));
xg = ggml_view_2d(ctx, xxx, n_embd, n_tokens, xxx->nb[1], n_embd * n_tokens * 4 * sizeof(float));
} else {
// for backward compatibility
xw = ggml_view_2d(ctx, xxx, n_embd, n_tokens, xxx->nb[1], 0);
xk = ggml_view_2d(ctx, xxx, n_embd, n_tokens, xxx->nb[1], n_embd * n_tokens * sizeof(float));
xv = ggml_view_2d(ctx, xxx, n_embd, n_tokens, xxx->nb[1], n_embd * n_tokens * 2 * sizeof(float));
xr = ggml_view_2d(ctx, xxx, n_embd, n_tokens, xxx->nb[1], n_embd * n_tokens * 3 * sizeof(float));
xg = ggml_view_2d(ctx, xxx, n_embd, n_tokens, xxx->nb[1], n_embd * n_tokens * 4 * sizeof(float));
struct ggml_tensor * xw = ggml_add(
ctx,
ggml_mul(
ctx,
ggml_add(ctx, mw, layer->time_mix_lerp_w),
sx
),
cur
);
xw = ggml_add(ctx, ggml_mul(ctx, ggml_add(ctx, xw, layer->time_mix_lerp_w), sx), cur);
xk = ggml_add(ctx, ggml_mul(ctx, ggml_add(ctx, xk, layer->time_mix_lerp_k), sx), cur);
xv = ggml_add(ctx, ggml_mul(ctx, ggml_add(ctx, xv, layer->time_mix_lerp_v), sx), cur);
xr = ggml_add(ctx, ggml_mul(ctx, ggml_add(ctx, xr, layer->time_mix_lerp_r), sx), cur);
xg = ggml_add(ctx, ggml_mul(ctx, ggml_add(ctx, xg, layer->time_mix_lerp_g), sx), cur);
}
struct ggml_tensor * xk = ggml_add(
ctx,
ggml_mul(
ctx,
ggml_add(ctx, mk, layer->time_mix_lerp_k),
sx
),
cur
);
struct ggml_tensor * r = llm_build_lora_mm(lctx, ctx, layer->time_mix_receptance, xr);
struct ggml_tensor * k = llm_build_lora_mm(lctx, ctx, layer->time_mix_key, xk);
struct ggml_tensor * v = llm_build_lora_mm(lctx, ctx, layer->time_mix_value, xv);
if (layer->time_mix_receptance_b) {
r = ggml_add(ctx, r, layer->time_mix_receptance_b);
}
if (layer->time_mix_key_b) {
k = ggml_add(ctx, k, layer->time_mix_key_b);
}
if (layer->time_mix_value_b) {
v = ggml_add(ctx, v, layer->time_mix_value_b);
}
struct ggml_tensor * xv = ggml_add(
ctx,
ggml_mul(
ctx,
ggml_add(ctx, mv, layer->time_mix_lerp_v),
sx
),
cur
);
struct ggml_tensor * g = llm_build_lora_mm(lctx, ctx, layer->time_mix_gate, xg);
if (is_qrwkv) {
g = ggml_sigmoid(ctx, g);
} else {
g = ggml_silu(ctx, g);
}
struct ggml_tensor * xr = ggml_add(
ctx,
ggml_mul(
ctx,
ggml_add(ctx, mr, layer->time_mix_lerp_r),
sx
),
cur
);
if (head_count_kv != head_count) {
GGML_ASSERT(head_count % head_count_kv == 0);
k = ggml_reshape_4d(ctx, k, head_size, 1, head_count_kv, n_tokens);
v = ggml_reshape_4d(ctx, v, head_size, 1, head_count_kv, n_tokens);
struct ggml_tensor * tmp = ggml_new_tensor_4d(ctx, GGML_TYPE_F32, head_size, head_count / head_count_kv, head_count_kv, n_tokens);
k = ggml_repeat(ctx, k, tmp);
v = ggml_repeat(ctx, v, tmp);
}
struct ggml_tensor * xg = ggml_add(
ctx,
ggml_mul(
ctx,
ggml_add(ctx, mg, layer->time_mix_lerp_g),
sx
),
cur
);
struct ggml_tensor * r = ggml_reshape_4d(ctx, llm_build_lora_mm(lctx, ctx, layer->time_mix_receptance, xr), head_size, 1, head_count, n_tokens);
struct ggml_tensor * k = ggml_reshape_4d(ctx, llm_build_lora_mm(lctx, ctx, layer->time_mix_key, xk), 1, head_size, head_count, n_tokens);
struct ggml_tensor * v = ggml_reshape_4d(ctx, llm_build_lora_mm(lctx, ctx, layer->time_mix_value, xv), head_size, 1, head_count, n_tokens);
struct ggml_tensor * g = ggml_silu(
ctx,
llm_build_lora_mm(lctx, ctx, layer->time_mix_gate, xg)
);
k = ggml_reshape_3d(ctx, k, head_size, head_count, n_tokens);
v = ggml_reshape_3d(ctx, v, head_size, head_count, n_tokens);
r = ggml_reshape_3d(ctx, r, head_size, head_count, n_tokens);
struct ggml_tensor * w = ggml_mul_mat(
ctx,
@@ -3390,25 +3502,35 @@ static struct ggml_tensor * llm_build_rwkv6_time_mix(
)
);
w = ggml_add(ctx, w, ggml_reshape_1d(ctx, layer->time_mix_decay, n_embd));
w = ggml_add(ctx, w, layer->time_mix_decay);
w = ggml_exp(ctx, ggml_neg(ctx, ggml_exp(ctx, w)));
w = ggml_reshape_4d(ctx, w, 1, head_size, head_count, n_tokens);
w = ggml_reshape_3d(ctx, w, head_size, head_count, n_tokens);
k = ggml_transpose(ctx, k);
v = ggml_transpose(ctx, v);
r = ggml_transpose(ctx, r);
if (is_qrwkv) {
// k = k * (1 - w)
k = ggml_sub(ctx, k, ggml_mul(ctx, k, w));
}
struct ggml_tensor * wkv_output = ggml_rwkv_wkv6(ctx, k, v, r, layer->time_mix_first, w, *wkv_state);
struct ggml_tensor * wkv_output;
if (!layer->time_mix_first) {
wkv_output = ggml_gated_linear_attn(ctx, k, v, r, w, *wkv_state, pow(head_size, -0.5f));
} else {
wkv_output = ggml_rwkv_wkv6(ctx, k, v, r, layer->time_mix_first, w, *wkv_state);
}
cur = ggml_view_1d(ctx, wkv_output, n_embd * n_tokens, 0);
*wkv_state = ggml_view_1d(ctx, wkv_output, n_embd * head_size * n_seqs, n_embd * n_tokens * sizeof(float));
// group norm with head_count groups
cur = ggml_reshape_3d(ctx, cur, n_embd / head_count, head_count, n_tokens);
cur = ggml_norm(ctx, cur, 64e-5f);
if (!is_qrwkv) {
// group norm with head_count groups
cur = ggml_reshape_3d(ctx, cur, n_embd / head_count, head_count, n_tokens);
cur = ggml_norm(ctx, cur, 64e-5f);
// Convert back to regular vectors.
cur = ggml_reshape_2d(ctx, cur, n_embd, n_tokens);
cur = ggml_add(ctx, ggml_mul(ctx, cur, layer->time_mix_ln), layer->time_mix_ln_b);
// Convert back to regular vectors.
cur = ggml_reshape_2d(ctx, cur, n_embd, n_tokens);
cur = ggml_add(ctx, ggml_mul(ctx, cur, layer->time_mix_ln), layer->time_mix_ln_b);
} else {
cur = ggml_reshape_2d(ctx, cur, n_embd, n_tokens);
}
cur = ggml_mul(ctx, cur, g);
cur = llm_build_lora_mm(lctx, ctx, layer->time_mix_output, cur);
@@ -3967,6 +4089,7 @@ struct llm_build_context {
// feed-forward network
if (model.layers[il].ffn_gate_inp == nullptr) {
cur = llm_build_norm(ctx0, ffn_inp, hparams,
model.layers[il].ffn_norm, NULL,
LLM_NORM_RMS, cb, il);
@@ -6251,7 +6374,7 @@ struct llm_build_context {
struct ggml_tensor* attn_norm_output = llm_build_norm(ctx0, inpL, hparams,
model.layers[il].attn_norm,
NULL,
model.layers[il].attn_norm_b,
LLM_NORM_RMS, cb, il);
cb(attn_norm_output, "attn_norm", il);
@@ -6266,8 +6389,7 @@ struct llm_build_context {
Qcur = ggml_cont(ctx0, ggml_view_2d(ctx0, cur, n_embd, n_tokens, cur->nb[1], 0 * sizeof(float) * (n_embd)));
Kcur = ggml_cont(ctx0, ggml_view_2d(ctx0, cur, n_embd_gqa, n_tokens, cur->nb[1], 1 * sizeof(float) * (n_embd)));
Vcur = ggml_cont(ctx0, ggml_view_2d(ctx0, cur, n_embd_gqa, n_tokens, cur->nb[1], 1 * sizeof(float) * (n_embd + n_embd_gqa)));
}
else {
} else {
Qcur = ggml_add(ctx0, llm_build_lora_mm(lctx, ctx0, model.layers[il].wq, attn_norm_output), model.layers[il].bq);
Kcur = ggml_add(ctx0, llm_build_lora_mm(lctx, ctx0, model.layers[il].wk, attn_norm_output), model.layers[il].bk);
Vcur = ggml_add(ctx0, llm_build_lora_mm(lctx, ctx0, model.layers[il].wv, attn_norm_output), model.layers[il].bv);
@@ -6311,14 +6433,12 @@ struct llm_build_context {
residual = cur;
cur = llm_build_norm(ctx0, cur, hparams,
model.layers[il].ffn_norm, NULL,
model.layers[il].ffn_norm, model.layers[il].ffn_norm_b,
LLM_NORM_RMS, cb, il);
cb(cur, "ffn_norm", il);
// FF
// special-case: the up and gate tensors are merged into a single tensor
// TOOD: support into llm_build_ffn
{
// feed-forward network
if (model.layers[il].ffn_gate_inp == nullptr) {
cur = llm_build_ffn(ctx0, lctx, cur,
model.layers[il].ffn_up, NULL, NULL,
NULL, NULL, NULL,
@@ -6326,6 +6446,20 @@ struct llm_build_context {
NULL,
LLM_FFN_SWIGLU, LLM_FFN_SEQ, cb, il);
cb(cur, "ffn_out", il);
} else {
// MoE branch
cur = llm_build_moe_ffn(ctx0, lctx, cur,
model.layers[il].ffn_gate_inp,
model.layers[il].ffn_up_exps,
model.layers[il].ffn_gate_exps,
model.layers[il].ffn_down_exps,
nullptr,
n_expert, n_expert_used,
LLM_FFN_SILU, true,
false, 0.0,
LLAMA_EXPERT_GATING_FUNC_TYPE_SOFTMAX,
cb, il);
cb(cur, "ffn_moe_out", il);
}
cur = ggml_add(ctx0, residual, cur);
@@ -6338,11 +6472,16 @@ struct llm_build_context {
cur = llm_build_norm(ctx0, inpL, hparams,
model.output_norm,
NULL,
model.output_norm_b,
LLM_NORM_RMS, cb, -1);
cb(cur, "result_norm", -1);
cur = llm_build_lora_mm(lctx, ctx0, model.output, cur);
if (model.output_b != nullptr) {
cb(cur, "result_output_no_bias", -1);
cur = ggml_add(ctx0, cur, model.output_b);
}
cb(cur, "result_output", -1);
ggml_build_forward_expand(gf, cur);
@@ -9973,7 +10112,7 @@ struct llm_build_context {
1
);
cur = ggml_add(ctx0, cur, llm_build_rwkv6_time_mix(lctx, ctx0, layer, x_norm_att, x_prev, &wkv_states));
cur = ggml_add(ctx0, cur, llm_build_rwkv6_time_mix(lctx, ctx0, layer, x_norm_att, x_prev, &wkv_states, hparams.wkv_head_size, n_embd / hparams.wkv_head_size));
ggml_build_forward_expand(gf, cur);
ggml_build_forward_expand(
gf,
@@ -10040,6 +10179,118 @@ struct llm_build_context {
return gf;
}
// ref: https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1/blob/main/modeling_rwkv6qwen2.py
ggml_cgraph * build_rwkv6qwen2() {
ggml_cgraph *gf = ggml_new_graph_custom(ctx0, llama_model_max_nodes(model), false);
GGML_ASSERT(n_embd == hparams.n_embd_k_s());
const int64_t n_seqs = ubatch.n_seqs;
const int64_t n_seq_tokens = ubatch.n_seq_tokens;
const int64_t n_tokens = ubatch.n_tokens;
GGML_ASSERT(n_seqs != 0);
GGML_ASSERT(ubatch.equal_seqs);
GGML_ASSERT(n_tokens == n_seq_tokens * n_seqs);
struct ggml_tensor * cur;
struct ggml_tensor * inpL;
struct ggml_tensor * state_copy = build_inp_s_copy();
struct ggml_tensor * state_mask = build_inp_s_mask();
inpL = llm_build_inp_embd(ctx0, lctx, hparams, ubatch, model.tok_embd, cb);
for (int il = 0; il < n_layer; ++il) {
const llama_layer * layer = &model.layers[il];
// (ab)using the KV cache to store the states
struct ggml_tensor * token_shift = llm_build_copy_mask_state(ctx0,
gf, kv_self.k_l[il], state_copy, state_mask,
hparams.n_embd_k_s(), kv_self.size, kv_head, n_kv, n_seqs);
struct ggml_tensor * wkv_states = llm_build_copy_mask_state(ctx0,
gf, kv_self.v_l[il], state_copy, state_mask,
hparams.n_embd_v_s(), kv_self.size, kv_head, n_kv, n_seqs);
cur = ggml_reshape_3d(ctx0, inpL, n_embd, n_seq_tokens, n_seqs);
token_shift = ggml_reshape_3d(ctx0, token_shift, n_embd, 1, n_seqs);
struct ggml_tensor * x_norm_att = llm_build_norm(ctx0, cur, hparams, layer->attn_norm, layer->attn_norm_b, LLM_NORM_RMS, cb, il);
struct ggml_tensor * x_prev = ggml_concat(
ctx0,
token_shift,
ggml_view_3d(ctx0, x_norm_att, n_embd, n_seq_tokens - 1, n_seqs, x_norm_att->nb[1], x_norm_att->nb[2], 0),
1
);
ggml_build_forward_expand(
gf,
ggml_cpy(
ctx0,
wkv_states,
ggml_view_1d(
ctx0,
kv_self.v_l[il],
hparams.n_embd_v_s() * n_seqs,
hparams.n_embd_v_s() * kv_head * ggml_element_size(kv_self.v_l[il])
)
)
);
struct ggml_tensor * ffn_inp = ggml_add(ctx0, cur, llm_build_rwkv6_time_mix(lctx, ctx0, layer, x_norm_att, x_prev, &wkv_states, hparams.wkv_head_size, hparams.n_head_kv()));
ggml_build_forward_expand(gf, ffn_inp);
ggml_build_forward_expand(
gf,
ggml_cpy(
ctx0,
wkv_states,
ggml_view_1d(
ctx0,
kv_self.v_l[il],
hparams.n_embd_v_s() * n_seqs,
hparams.n_embd_v_s() * kv_head * ggml_element_size(kv_self.v_l[il])
)
)
);
cb(ffn_inp, "ffn_inp", il);
// feed-forward network
cur = llm_build_norm(ctx0, ffn_inp, hparams,
model.layers[il].ffn_norm, NULL,
LLM_NORM_RMS, cb, il);
cb(cur, "ffn_norm", il);
cur = llm_build_ffn(ctx0, lctx, cur,
model.layers[il].ffn_up, NULL, NULL,
model.layers[il].ffn_gate, NULL, NULL,
model.layers[il].ffn_down, NULL, NULL,
NULL,
LLM_FFN_SILU, LLM_FFN_PAR, cb, il);
cb(cur, "ffn_out", il);
cur = ggml_add(ctx0, cur, ffn_inp);
cur = lctx.cvec.apply_to(ctx0, cur, il);
cb(cur, "l_out", il);
// input for next layer
inpL = cur;
}
cur = inpL;
struct ggml_tensor * inp_out_ids = build_inp_out_ids();
cur = ggml_reshape_2d(ctx0, cur, n_embd, n_tokens);
cur = ggml_get_rows(ctx0, cur, inp_out_ids);
cur = llm_build_norm(ctx0, cur, hparams, model.output_norm, model.output_norm_b, LLM_NORM_RMS, cb, -1);
cb(cur, "result_norm", -1);
cur = llm_build_lora_mm(lctx, ctx0, model.output, cur);
cb(cur, "result_output", -1);
ggml_build_forward_expand(gf, cur);
return gf;
}
// ref: https://github.com/facebookresearch/chameleon
// based on the original build_llama() function, changes:
// * qk-norm
@@ -10521,6 +10772,7 @@ static struct ggml_cgraph * llama_build_graph(
result = llm.build_phi2();
} break;
case LLM_ARCH_PHI3:
case LLM_ARCH_PHIMOE:
{
result = llm.build_phi3();
} break;
@@ -10648,6 +10900,10 @@ static struct ggml_cgraph * llama_build_graph(
{
result = llm.build_rwkv6();
} break;
case LLM_ARCH_RWKV6QWEN2:
{
result = llm.build_rwkv6qwen2();
} break;
case LLM_ARCH_CHAMELEON:
{
result = llm.build_chameleon();
+38 -4
View File
@@ -1659,17 +1659,46 @@ struct test_rwkv_wkv6 : public test_case {
ggml_tensor * build_graph(ggml_context * ctx) override {
const int64_t n_tokens = n_seq_tokens * n_seqs;
ggml_tensor * r = ggml_new_tensor(ctx, type, 4, std::vector<int64_t>{ 1, head_size, head_count, n_tokens }.data());
ggml_tensor * k = ggml_new_tensor(ctx, type, 4, std::vector<int64_t>{ head_size, 1, head_count, n_tokens }.data());
ggml_tensor * v = ggml_new_tensor(ctx, type, 4, std::vector<int64_t>{ 1, head_size, head_count, n_tokens }.data());
ggml_tensor * r = ggml_new_tensor(ctx, type, 3, std::vector<int64_t>{ head_size, head_count, n_tokens }.data());
ggml_tensor * k = ggml_new_tensor(ctx, type, 3, std::vector<int64_t>{ head_size, head_count, n_tokens }.data());
ggml_tensor * v = ggml_new_tensor(ctx, type, 3, std::vector<int64_t>{ head_size, head_count, n_tokens }.data());
ggml_tensor * tf = ggml_new_tensor(ctx, type, 2, std::vector<int64_t>{ head_size, head_count }.data());
ggml_tensor * td = ggml_new_tensor(ctx, type, 4, std::vector<int64_t>{ 1, head_size, head_count, n_tokens }.data());
ggml_tensor * td = ggml_new_tensor(ctx, type, 3, std::vector<int64_t>{ head_size, head_count, n_tokens }.data());
ggml_tensor * s = ggml_new_tensor(ctx, type, 2, std::vector<int64_t>{ head_size * head_size * head_count, n_seqs }.data());
ggml_tensor * out = ggml_rwkv_wkv6(ctx, k, v, r, tf, td, s);
return out;
}
};
// GGML_OP_GATED_LINEAR_ATTN
struct test_gla : public test_case {
const ggml_type type;
const int64_t head_count;
const int64_t head_size;
const int64_t n_seq_tokens;
const int64_t n_seqs;
std::string vars() override {
return VARS_TO_STR5(type, head_count, head_size, n_seq_tokens, n_seqs);
}
test_gla(ggml_type type = GGML_TYPE_F32,
int64_t head_count = 32, int64_t head_size = 64, int64_t n_seq_tokens = 32, int64_t n_seqs = 32)
: type(type), head_count(head_count), head_size(head_size), n_seq_tokens(n_seq_tokens), n_seqs(n_seqs) {}
ggml_tensor * build_graph(ggml_context * ctx) override {
const int64_t n_tokens = n_seq_tokens * n_seqs;
ggml_tensor * q = ggml_new_tensor(ctx, type, 3, std::vector<int64_t>{ head_size, head_count, n_tokens }.data());
ggml_tensor * k = ggml_new_tensor(ctx, type, 3, std::vector<int64_t>{ head_size, head_count, n_tokens }.data());
ggml_tensor * v = ggml_new_tensor(ctx, type, 3, std::vector<int64_t>{ head_size, head_count, n_tokens }.data());
ggml_tensor * g = ggml_new_tensor(ctx, type, 3, std::vector<int64_t>{ head_size, head_count, n_tokens }.data());
ggml_tensor * s = ggml_new_tensor(ctx, type, 2, std::vector<int64_t>{ head_size * head_size * head_count, n_seqs }.data());
ggml_tensor * out = ggml_gated_linear_attn(ctx, k, v, q, g, s, pow(head_size, -0.5));
return out;
}
};
// GGML_OP_MUL_MAT
struct test_mul_mat : public test_case {
const ggml_type type_a;
@@ -3626,6 +3655,11 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
test_cases.emplace_back(new test_rwkv_wkv6(GGML_TYPE_F32, 32, 64, 32, 4));
test_cases.emplace_back(new test_rwkv_wkv6(GGML_TYPE_F32, 32, 64, 128, 4));
test_cases.emplace_back(new test_gla(GGML_TYPE_F32, 32, 64, 1, 1));
test_cases.emplace_back(new test_gla(GGML_TYPE_F32, 32, 64, 32, 1));
test_cases.emplace_back(new test_gla(GGML_TYPE_F32, 32, 64, 32, 4));
test_cases.emplace_back(new test_gla(GGML_TYPE_F32, 32, 64, 128, 4));
for (int i = 1; i < 9; ++i) {
test_cases.emplace_back(new test_mul_mat(GGML_TYPE_F16, GGML_TYPE_F32, 16, i, 256, { 1, 1}, {1, 1}));
test_cases.emplace_back(new test_mul_mat(GGML_TYPE_Q4_0, GGML_TYPE_F32, 16, i, 256, { 1, 1}, {1, 1}));
+5 -1
View File
@@ -78,7 +78,9 @@ int main(void) {
// ai-sage/GigaChat-20B-A3B-instruct
"{% if messages[0]['role'] == 'system' -%}\n {%- set loop_messages = messages[1:] -%}\n {%- set system_message = bos_token + messages[0]['content'] + additional_special_tokens[1] -%}\n{%- else -%}\n {%- set loop_messages = messages -%}\n {%- set system_message = bos_token + '' -%}\n{%- endif -%}\n{%- for message in loop_messages %}\n {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}\n {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}\n {% endif %}\n \n {%- if loop.index0 == 0 -%}\n {{ system_message -}}\n {%- endif -%}\n {%- if message['role'] == 'user' -%}\n {{ message['role'] + additional_special_tokens[0] + message['content'] + additional_special_tokens[1] -}}\n {{ 'available functions' + additional_special_tokens[0] + additional_special_tokens[2] + additional_special_tokens[3] + additional_special_tokens[1] -}}\n {%- endif -%}\n {%- if message['role'] == 'assistant' -%}\n {{ message['role'] + additional_special_tokens[0] + message['content'] + additional_special_tokens[1] -}}\n {%- endif -%}\n {%- if loop.last and add_generation_prompt -%}\n {{ 'assistant' + additional_special_tokens[0] -}}\n {%- endif -%}\n{%- endfor %}",
// Infinigence/Megrez-3B-Instruct
u8"{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|role_start|>system<|role_end|>你是Megrez-3B-Instruct,将针对用户的问题给出详细的、积极的回答。<|turn_end|>' }}{% endif %}{{ '<|role_start|>' + message['role'] + '<|role_end|>' + message['content'] + '<|turn_end|>' }}{% endfor %}{% if add_generation_prompt %}{{ '<|role_start|>assistant<|role_end|>' }}{% endif %}"
u8"{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|role_start|>system<|role_end|>你是Megrez-3B-Instruct,将针对用户的问题给出详细的、积极的回答。<|turn_end|>' }}{% endif %}{{ '<|role_start|>' + message['role'] + '<|role_end|>' + message['content'] + '<|turn_end|>' }}{% endfor %}{% if add_generation_prompt %}{{ '<|role_start|>assistant<|role_end|>' }}{% endif %}",
// phi-4
"{% for message in messages %}{% if (message['role'] == 'system') %}{{'<|im_start|>system<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'user') %}{{'<|im_start|>user<|im_sep|>' + message['content'] + '<|im_end|><|im_start|>assistant<|im_sep|>'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|im_end|>'}}{% endif %}{% endfor %}",
};
std::vector<std::string> expected_output = {
// teknium/OpenHermes-2.5-Mistral-7B
@@ -137,6 +139,8 @@ int main(void) {
"<s>You are a helpful assistant<|message_sep|>user<|role_sep|>Hello<|message_sep|>available functions<|role_sep|>[]<|message_sep|>assistant<|role_sep|>Hi there<|message_sep|>user<|role_sep|>Who are you<|message_sep|>available functions<|role_sep|>[]<|message_sep|>assistant<|role_sep|> I am an assistant <|message_sep|>user<|role_sep|>Another question<|message_sep|>available functions<|role_sep|>[]<|message_sep|>assistant<|role_sep|>",
// Infinigence/Megrez-3B-Instruct
"<|role_start|>system<|role_end|>You are a helpful assistant<|turn_end|><|role_start|>user<|role_end|>Hello<|turn_end|><|role_start|>assistant<|role_end|>Hi there<|turn_end|><|role_start|>user<|role_end|>Who are you<|turn_end|><|role_start|>assistant<|role_end|> I am an assistant <|turn_end|><|role_start|>user<|role_end|>Another question<|turn_end|><|role_start|>assistant<|role_end|>",
// phi-4
"<|im_start|>system<|im_sep|>You are a helpful assistant<|im_end|><|im_start|>user<|im_sep|>Hello<|im_end|><|im_start|>assistant<|im_sep|>Hi there<|im_end|><|im_start|>user<|im_sep|>Who are you<|im_end|><|im_start|>assistant<|im_sep|> I am an assistant <|im_end|><|im_start|>user<|im_sep|>Another question<|im_end|><|im_start|>assistant<|im_sep|>",
};
std::vector<char> formatted_chat(1024);
int32_t res;
+201 -170
View File
@@ -15,66 +15,71 @@ constexpr int offset_has_tensors = 2000;
constexpr int offset_has_data = 3000;
enum handcrafted_file_type {
HANDCRAFTED_HEADER_BAD_MAGIC = 10,
HANDCRAFTED_HEADER_BAD_VERSION_1 = 20,
HANDCRAFTED_HEADER_BAD_VERSION_FUTURE = 30,
HANDCRAFTED_HEADER_BAD_N_TENSORS = 40,
HANDCRAFTED_HEADER_BAD_N_KV = 50,
HANDCRAFTED_HEADER_EMPTY = 800,
HANDCRAFTED_HEADER_BAD_MAGIC = 10,
HANDCRAFTED_HEADER_BAD_VERSION_1 = 20,
HANDCRAFTED_HEADER_BAD_VERSION_FUTURE = 30,
HANDCRAFTED_HEADER_BAD_N_TENSORS = 40,
HANDCRAFTED_HEADER_BAD_N_KV = 50,
HANDCRAFTED_HEADER_EMPTY = 800,
HANDCRAFTED_KV_BAD_KEY_SIZE = 10 + offset_has_kv,
HANDCRAFTED_KV_BAD_TYPE = 20 + offset_has_kv,
HANDCRAFTED_KV_BAD_VALUE_SIZE = 30 + offset_has_kv,
HANDCRAFTED_KV_DUPLICATE_KEY = 40 + offset_has_kv,
HANDCRAFTED_KV_SUCCESS = 800 + offset_has_kv,
HANDCRAFTED_KV_BAD_KEY_SIZE = 10 + offset_has_kv,
HANDCRAFTED_KV_BAD_TYPE = 20 + offset_has_kv,
// HANDCRAFTED_KV_BAD_VALUE_SIZE = 30 + offset_has_kv, // removed because it can result in allocations > 1 TB (default sanitizer limit)
HANDCRAFTED_KV_DUPLICATE_KEY = 40 + offset_has_kv,
HANDCRAFTED_KV_BAD_ALIGN = 50 + offset_has_kv,
HANDCRAFTED_KV_SUCCESS = 800 + offset_has_kv,
HANDCRAFTED_TENSORS_BAD_NAME_SIZE = 10 + offset_has_tensors,
HANDCRAFTED_TENSORS_BAD_N_DIMS = 20 + offset_has_tensors,
HANDCRAFTED_TENSORS_BAD_SHAPE = 30 + offset_has_tensors,
HANDCRAFTED_TENSORS_NE_TOO_BIG = 40 + offset_has_tensors,
HANDCRAFTED_TENSORS_BAD_TYPE = 50 + offset_has_tensors,
HANDCRAFTED_TENSORS_BAD_OFFSET = 60 + offset_has_tensors,
HANDCRAFTED_TENSORS_DUPLICATE_NAME = 70 + offset_has_tensors,
HANDCRAFTED_TENSORS_BAD_ALIGNMENT = 80 + offset_has_tensors,
HANDCRAFTED_TENSORS_SUCCESS = 800 + offset_has_tensors,
HANDCRAFTED_TENSORS_CUSTOM_ALIGN = 810 + offset_has_tensors,
HANDCRAFTED_TENSORS_BAD_NAME_SIZE = 10 + offset_has_tensors,
HANDCRAFTED_TENSORS_BAD_N_DIMS = 20 + offset_has_tensors,
HANDCRAFTED_TENSORS_BAD_SHAPE = 30 + offset_has_tensors,
HANDCRAFTED_TENSORS_NE_TOO_BIG = 40 + offset_has_tensors,
HANDCRAFTED_TENSORS_BAD_TYPE = 50 + offset_has_tensors,
HANDCRAFTED_TENSORS_BAD_OFFSET = 60 + offset_has_tensors,
HANDCRAFTED_TENSORS_DUPLICATE_NAME = 70 + offset_has_tensors,
HANDCRAFTED_TENSORS_BAD_ALIGN = 75 + offset_has_tensors,
HANDCRAFTED_TENSORS_INCONSISTENT_ALIGN = 80 + offset_has_tensors,
HANDCRAFTED_TENSORS_SUCCESS = 800 + offset_has_tensors,
HANDCRAFTED_TENSORS_CUSTOM_ALIGN = 810 + offset_has_tensors,
HANDCRAFTED_DATA_NOT_ENOUGH_DATA = 10 + offset_has_data,
HANDCRAFTED_DATA_BAD_ALIGNMENT = 20 + offset_has_data,
HANDCRAFTED_DATA_SUCCESS = 800 + offset_has_data,
HANDCRAFTED_DATA_CUSTOM_ALIGN = 810 + offset_has_data,
HANDCRAFTED_DATA_NOT_ENOUGH_DATA = 10 + offset_has_data,
HANDCRAFTED_DATA_BAD_ALIGN = 15 + offset_has_data,
HANDCRAFTED_DATA_INCONSISTENT_ALIGN = 20 + offset_has_data,
HANDCRAFTED_DATA_SUCCESS = 800 + offset_has_data,
HANDCRAFTED_DATA_CUSTOM_ALIGN = 810 + offset_has_data,
};
std::string handcrafted_file_type_name(const enum handcrafted_file_type hft) {
switch (hft) {
case HANDCRAFTED_HEADER_BAD_MAGIC: return "HEADER_BAD_MAGIC";
case HANDCRAFTED_HEADER_BAD_VERSION_1: return "HEADER_BAD_VERSION_1";
case HANDCRAFTED_HEADER_BAD_VERSION_FUTURE: return "HEADER_BAD_VERSION_FUTURE";
case HANDCRAFTED_HEADER_BAD_N_KV: return "HEADER_BAD_N_KV";
case HANDCRAFTED_HEADER_BAD_N_TENSORS: return "HEADER_BAD_N_TENSORS";
case HANDCRAFTED_HEADER_EMPTY: return "HEADER_EMPTY";
case HANDCRAFTED_HEADER_BAD_MAGIC: return "HEADER_BAD_MAGIC";
case HANDCRAFTED_HEADER_BAD_VERSION_1: return "HEADER_BAD_VERSION_1";
case HANDCRAFTED_HEADER_BAD_VERSION_FUTURE: return "HEADER_BAD_VERSION_FUTURE";
case HANDCRAFTED_HEADER_BAD_N_KV: return "HEADER_BAD_N_KV";
case HANDCRAFTED_HEADER_BAD_N_TENSORS: return "HEADER_BAD_N_TENSORS";
case HANDCRAFTED_HEADER_EMPTY: return "HEADER_EMPTY";
case HANDCRAFTED_KV_BAD_KEY_SIZE: return "KV_BAD_KEY_SIZE";
case HANDCRAFTED_KV_BAD_TYPE: return "KV_BAD_TYPE";
case HANDCRAFTED_KV_BAD_VALUE_SIZE: return "KV_BAD_VALUE_SIZE";
case HANDCRAFTED_KV_DUPLICATE_KEY: return "KV_DUPLICATE_KEY";
case HANDCRAFTED_KV_SUCCESS: return "KV_RANDOM_KV";
case HANDCRAFTED_KV_BAD_KEY_SIZE: return "KV_BAD_KEY_SIZE";
case HANDCRAFTED_KV_BAD_TYPE: return "KV_BAD_TYPE";
case HANDCRAFTED_KV_DUPLICATE_KEY: return "KV_DUPLICATE_KEY";
case HANDCRAFTED_KV_BAD_ALIGN: return "KV_BAD_ALIGN";
case HANDCRAFTED_KV_SUCCESS: return "KV_RANDOM_KV";
case HANDCRAFTED_TENSORS_BAD_NAME_SIZE: return "TENSORS_BAD_NAME_SIZE";
case HANDCRAFTED_TENSORS_BAD_N_DIMS: return "TENSORS_BAD_N_DIMS";
case HANDCRAFTED_TENSORS_BAD_SHAPE: return "TENSORS_BAD_SHAPE";
case HANDCRAFTED_TENSORS_NE_TOO_BIG: return "TENSORS_NE_TOO_BIG";
case HANDCRAFTED_TENSORS_BAD_TYPE: return "TENSORS_BAD_TYPE";
case HANDCRAFTED_TENSORS_BAD_OFFSET: return "TENSORS_BAD_OFFSET";
case HANDCRAFTED_TENSORS_DUPLICATE_NAME: return "TENSORS_DUPLICATE_NAME";
case HANDCRAFTED_TENSORS_BAD_ALIGNMENT: return "TENSORS_BAD_ALIGNMENT";
case HANDCRAFTED_TENSORS_SUCCESS: return "TENSORS_SUCCESS";
case HANDCRAFTED_TENSORS_CUSTOM_ALIGN: return "TENSORS_CUSTOM_ALIGN";
case HANDCRAFTED_TENSORS_BAD_NAME_SIZE: return "TENSORS_BAD_NAME_SIZE";
case HANDCRAFTED_TENSORS_BAD_N_DIMS: return "TENSORS_BAD_N_DIMS";
case HANDCRAFTED_TENSORS_BAD_SHAPE: return "TENSORS_BAD_SHAPE";
case HANDCRAFTED_TENSORS_NE_TOO_BIG: return "TENSORS_NE_TOO_BIG";
case HANDCRAFTED_TENSORS_BAD_TYPE: return "TENSORS_BAD_TYPE";
case HANDCRAFTED_TENSORS_BAD_OFFSET: return "TENSORS_BAD_OFFSET";
case HANDCRAFTED_TENSORS_DUPLICATE_NAME: return "TENSORS_DUPLICATE_NAME";
case HANDCRAFTED_TENSORS_BAD_ALIGN: return "TENSORS_BAD_ALIGN";
case HANDCRAFTED_TENSORS_INCONSISTENT_ALIGN: return "TENSORS_INCONSISTENT_ALIGN";
case HANDCRAFTED_TENSORS_SUCCESS: return "TENSORS_SUCCESS";
case HANDCRAFTED_TENSORS_CUSTOM_ALIGN: return "TENSORS_CUSTOM_ALIGN";
case HANDCRAFTED_DATA_NOT_ENOUGH_DATA: return "DATA_NOT_ENOUGH_DATA";
case HANDCRAFTED_DATA_BAD_ALIGNMENT: return "DATA_BAD_ALIGNMENT";
case HANDCRAFTED_DATA_SUCCESS: return "DATA_SUCCESS";
case HANDCRAFTED_DATA_CUSTOM_ALIGN: return "DATA_CUSTOM_ALIGN";
case HANDCRAFTED_DATA_NOT_ENOUGH_DATA: return "DATA_NOT_ENOUGH_DATA";
case HANDCRAFTED_DATA_BAD_ALIGN: return "DATA_BAD_ALIGN";
case HANDCRAFTED_DATA_INCONSISTENT_ALIGN: return "DATA_INCONSISTENT_ALIGN";
case HANDCRAFTED_DATA_SUCCESS: return "DATA_SUCCESS";
case HANDCRAFTED_DATA_CUSTOM_ALIGN: return "DATA_CUSTOM_ALIGN";
}
GGML_ABORT("fatal error");
}
@@ -140,31 +145,41 @@ std::vector<std::pair<enum gguf_type, enum gguf_type>> get_kv_types(std::mt19937
return kv_types;
}
static void helper_write(const void * data, const size_t nbytes, FILE * file) {
template <typename T>
static void helper_write(FILE * file, const T & val) {
GGML_ASSERT(fwrite(&val, 1, sizeof(val), file) == sizeof(val));
}
static void helper_write(FILE * file, const void * data, const size_t nbytes) {
GGML_ASSERT(fwrite(data, 1, nbytes, file) == nbytes);
}
static FILE * get_handcrafted_file(const unsigned int seed, const enum handcrafted_file_type hft, const int extra_bytes = 0) {
FILE * file = tmpfile();
if (!file) {
return file;
}
std::mt19937 rng(seed);
uint32_t alignment = GGUF_DEFAULT_ALIGNMENT;
if (hft == HANDCRAFTED_HEADER_BAD_MAGIC) {
const char bad_magic[4] = {'F', 'U', 'G', 'G'};
helper_write(bad_magic, sizeof(bad_magic), file);
helper_write(file, bad_magic, sizeof(bad_magic));
} else {
helper_write(GGUF_MAGIC, 4, file);
helper_write(file, GGUF_MAGIC, 4);
}
if (hft == HANDCRAFTED_HEADER_BAD_VERSION_1) {
const uint32_t version = 1;
helper_write(&version, sizeof(version), file);
helper_write(file, version);
} else if (hft == HANDCRAFTED_HEADER_BAD_VERSION_FUTURE) {
const uint32_t version = GGUF_VERSION + 1;
helper_write(&version, sizeof(version), file);
helper_write(file, version);
} else {
const uint32_t version = GGUF_VERSION;
helper_write(&version, sizeof(version), file);
helper_write(file, version);
}
std::vector<tensor_config_t> tensor_configs;
@@ -174,10 +189,10 @@ static FILE * get_handcrafted_file(const unsigned int seed, const enum handcraft
if (hft == HANDCRAFTED_HEADER_BAD_N_TENSORS) {
const uint64_t n_tensors = -1;
helper_write(&n_tensors, sizeof(n_tensors), file);
helper_write(file, n_tensors);
} else {
const uint64_t n_tensors = tensor_configs.size();
helper_write(&n_tensors, sizeof(n_tensors), file);
helper_write(file, n_tensors);
}
std::vector<std::pair<enum gguf_type, enum gguf_type>> kv_types;
@@ -186,41 +201,49 @@ static FILE * get_handcrafted_file(const unsigned int seed, const enum handcraft
}
{
uint64_t n_kv = kv_types.size();
if (hft == HANDCRAFTED_TENSORS_CUSTOM_ALIGN || hft == HANDCRAFTED_DATA_CUSTOM_ALIGN) {
if (hft == HANDCRAFTED_KV_BAD_ALIGN ||
hft == HANDCRAFTED_TENSORS_BAD_ALIGN || hft == HANDCRAFTED_TENSORS_CUSTOM_ALIGN ||
hft == HANDCRAFTED_DATA_BAD_ALIGN || hft == HANDCRAFTED_DATA_CUSTOM_ALIGN) {
n_kv += 1;
} else if (hft == HANDCRAFTED_HEADER_BAD_N_KV) {
n_kv = -1;
}
helper_write(&n_kv, sizeof(n_kv), file);
helper_write(file, n_kv);
}
if (hft < offset_has_kv) {
while (ftell(file) % alignment != 0) {
const char pad = 0;
helper_write(file, pad);
}
for (int i = 0; i < extra_bytes; ++i) {
const char tmp = 0;
helper_write(&tmp, sizeof(tmp), file);
helper_write(file, tmp);
}
rewind(file);
return file;
}
for (int i = 0; i < int(kv_types.size()); ++i) {
const enum gguf_type type = gguf_type(hft == HANDCRAFTED_KV_BAD_TYPE ? -1 : kv_types[i].first);
const enum gguf_type type_arr = gguf_type(hft == HANDCRAFTED_KV_BAD_TYPE ? -1 : kv_types[i].second);
const enum gguf_type type = gguf_type(hft == HANDCRAFTED_KV_BAD_TYPE ? GGUF_TYPE_COUNT : kv_types[i].first);
const enum gguf_type type_arr = gguf_type(hft == HANDCRAFTED_KV_BAD_TYPE ? GGUF_TYPE_COUNT : kv_types[i].second);
const std::string key = "my_key_" + std::to_string((hft == HANDCRAFTED_KV_DUPLICATE_KEY ? i/2 : i));
if (hft == HANDCRAFTED_KV_BAD_KEY_SIZE) {
const uint64_t n = -1;
helper_write(&n, sizeof(n), file);
helper_write(file, n);
} else {
const uint64_t n = key.length();
helper_write(&n, sizeof(n), file);
helper_write(file, n);
}
helper_write(key.data(), key.length(), file);
helper_write(file, key.data(), key.length());
{
const int32_t type32 = int32_t(type);
helper_write(&type32, sizeof(type32), file);
helper_write(file, type32);
}
uint32_t data[16];
@@ -233,69 +256,67 @@ static FILE * get_handcrafted_file(const unsigned int seed, const enum handcraft
if (type == GGUF_TYPE_STRING) {
const uint64_t n = rng() % sizeof(data);
helper_write(&n, sizeof(n), file);
helper_write(data, n, file);
helper_write(file, n);
helper_write(file, data, n);
continue;
}
if (type == GGUF_TYPE_ARRAY) {
{
const int32_t type32 = int32_t(type_arr);
helper_write(&type32, sizeof(type32), file);
helper_write(file, type32);
}
if (type_arr == GGUF_TYPE_STRING) {
const uint64_t nstr = rng() % (16 + 1);
helper_write(&nstr, sizeof(nstr), file);
helper_write(file, nstr);
for (uint64_t istr = 0; istr < nstr; ++istr) {
const uint64_t n = rng() % (sizeof(uint32_t) + 1);
helper_write(&n, sizeof(n), file);
helper_write(&data[istr], n, file);
helper_write(file, n);
helper_write(file, &data[istr], n);
}
continue;
}
const size_t type_size = gguf_type_size(type_arr);
const uint64_t n = (rng() % sizeof(data)) / type_size;
helper_write(&n, sizeof(n), file);
helper_write(&data, n*type_size, file);
helper_write(file, n);
helper_write(file, &data, n*type_size);
continue;
}
size_t type_size = hft == HANDCRAFTED_KV_BAD_TYPE ? 1 : gguf_type_size(type);
if (hft == HANDCRAFTED_KV_BAD_VALUE_SIZE) {
type_size += rng() % 3;
}
helper_write(data, type_size, file);
helper_write(file, data, hft == HANDCRAFTED_KV_BAD_TYPE ? 1 : gguf_type_size(type));
}
if (hft == HANDCRAFTED_TENSORS_CUSTOM_ALIGN || hft == HANDCRAFTED_DATA_CUSTOM_ALIGN) {
const std::string key = "general.alignment";
{
const uint64_t n = key.length();
helper_write(&n, sizeof(n), file);
}
helper_write(key.data(), key.length(), file);
if (hft == HANDCRAFTED_KV_BAD_ALIGN ||
hft == HANDCRAFTED_TENSORS_BAD_ALIGN || hft == HANDCRAFTED_TENSORS_CUSTOM_ALIGN ||
hft == HANDCRAFTED_DATA_BAD_ALIGN || hft == HANDCRAFTED_DATA_CUSTOM_ALIGN) {
const uint64_t n = strlen(GGUF_KEY_GENERAL_ALIGNMENT);
helper_write(file, n);
helper_write(file, GGUF_KEY_GENERAL_ALIGNMENT, n);
const int32_t type = gguf_type(GGUF_TYPE_UINT32);
helper_write(&type, sizeof(type), file);
helper_write(file, type);
const uint32_t alignment = GGUF_DEFAULT_ALIGNMENT + 1;
helper_write(&alignment, sizeof(alignment), file);
alignment = expect_context_not_null(hft) ? 1 : 13;
helper_write(file, alignment);
}
if (hft < offset_has_tensors) {
while (ftell(file) % alignment != 0) {
const char pad = 0;
helper_write(file, pad);
}
for (int i = 0; i < extra_bytes; ++i) {
const char tmp = 0;
helper_write(&tmp, sizeof(tmp), file);
helper_write(file, tmp);
}
rewind(file);
return file;
}
uint32_t alignment = GGUF_DEFAULT_ALIGNMENT;
if (hft == HANDCRAFTED_TENSORS_BAD_ALIGNMENT || hft == HANDCRAFTED_DATA_BAD_ALIGNMENT) {
alignment -= 1;
} else if (hft == HANDCRAFTED_TENSORS_CUSTOM_ALIGN || hft == HANDCRAFTED_DATA_CUSTOM_ALIGN) {
alignment += 1;
if (hft == HANDCRAFTED_TENSORS_INCONSISTENT_ALIGN || hft == HANDCRAFTED_DATA_INCONSISTENT_ALIGN) {
alignment = 1;
}
uint64_t offset = 0;
@@ -313,9 +334,9 @@ static FILE * get_handcrafted_file(const unsigned int seed, const enum handcraft
}
{
const uint64_t n = name.length();
helper_write(&n, sizeof(n), file);
helper_write(file, n);
}
helper_write(name.data(), name.length(), file);
helper_write(file, name.data(), name.length());
uint32_t n_dims = hft == HANDCRAFTED_TENSORS_NE_TOO_BIG ? 2 : 1;
for (int i = GGML_MAX_DIMS-1; i >= 1; --i) {
@@ -326,35 +347,35 @@ static FILE * get_handcrafted_file(const unsigned int seed, const enum handcraft
}
if (hft == HANDCRAFTED_TENSORS_BAD_N_DIMS) {
const uint32_t n_dims_bad = GGML_MAX_DIMS + 1;
helper_write(&n_dims_bad, sizeof(n_dims_bad), file);
helper_write(file, n_dims_bad);
} else {
helper_write(&n_dims, sizeof(n_dims), file);
helper_write(file, n_dims);
}
if (hft == HANDCRAFTED_TENSORS_BAD_SHAPE) {
for (uint32_t j = 0; j < n_dims; ++j) {
const int64_t bad_dim = -1;
helper_write(&bad_dim, sizeof(bad_dim), file);
helper_write(file, bad_dim);
}
} else if (hft == HANDCRAFTED_TENSORS_NE_TOO_BIG){
for (uint32_t j = 0; j < n_dims; ++j) {
const int64_t big_dim = 4*int64_t(INT32_MAX);
helper_write(&big_dim, sizeof(big_dim), file);
helper_write(file, big_dim);
}
} else {
helper_write(shape.data(), n_dims*sizeof(int64_t), file);
helper_write(file, shape.data(), n_dims*sizeof(int64_t));
}
{
const int32_t type32 = hft == HANDCRAFTED_TENSORS_BAD_TYPE ? -1 : int32_t(type);
helper_write(&type32, sizeof(type32), file);
const int32_t type32 = hft == HANDCRAFTED_TENSORS_BAD_TYPE ? GGML_TYPE_COUNT : int32_t(type);
helper_write(file, type32);
}
if (hft == HANDCRAFTED_TENSORS_BAD_OFFSET) {
const uint64_t bad_offset = -1;
helper_write(&bad_offset, sizeof(bad_offset), file);
helper_write(file, bad_offset);
} else {
helper_write(&offset, sizeof(offset), file);
helper_write(file, offset);
}
int64_t ne = shape[0];
@@ -364,12 +385,9 @@ static FILE * get_handcrafted_file(const unsigned int seed, const enum handcraft
offset += GGML_PAD(ggml_row_size(type, ne), alignment);
}
const uint32_t alignment_overshoot = ftell(file) % alignment;
if (alignment_overshoot != 0) {
for (size_t i = alignment_overshoot; i < alignment; ++i) {
const char pad = 0;
helper_write(&pad, sizeof(pad), file);
}
while (ftell(file) % alignment != 0) {
const char pad = 0;
helper_write(file, pad);
}
if (hft >= offset_has_data) {
@@ -380,13 +398,13 @@ static FILE * get_handcrafted_file(const unsigned int seed, const enum handcraft
}
for (uint64_t i = 0; i < nbytes; ++i) {
const uint8_t random_byte = i % 256;
helper_write(&random_byte, sizeof(random_byte), file);
helper_write(file, random_byte);
}
}
for (int i = 0; i < extra_bytes; ++i) {
const char tmp = 0;
helper_write(&tmp, sizeof(tmp), file);
helper_write(file, tmp);
}
rewind(file);
return file;
@@ -505,6 +523,16 @@ static bool handcrafted_check_kv(const gguf_context * gguf_ctx, const unsigned i
}
const char * data_gguf = reinterpret_cast<const char *>(gguf_get_arr_data(gguf_ctx, id));
if (type_arr == GGUF_TYPE_BOOL) {
for (size_t arr_i = 0; arr_i < arr_n; ++arr_i) {
if (bool(data8[arr_i]) != bool(data_gguf[arr_i])) {
ok = false;
}
}
continue;
}
if (!std::equal(data8, data8 + arr_n*type_size, data_gguf)) {
ok = false;
}
@@ -512,12 +540,20 @@ static bool handcrafted_check_kv(const gguf_context * gguf_ctx, const unsigned i
}
const char * data_gguf = reinterpret_cast<const char *>(gguf_get_val_data(gguf_ctx, id));
if (type == GGUF_TYPE_BOOL) {
if (bool(*data8) != bool(*data_gguf)) {
ok = false;
}
continue;
}
if (!std::equal(data8, data8 + gguf_type_size(type), data_gguf)) {
ok = false;
}
}
const uint32_t expected_alignment = alignment_defined ? GGUF_DEFAULT_ALIGNMENT + 1 : GGUF_DEFAULT_ALIGNMENT;
const uint32_t expected_alignment = alignment_defined ? 1 : GGUF_DEFAULT_ALIGNMENT;
if (gguf_get_alignment(gguf_ctx) != expected_alignment) {
ok = false;
}
@@ -539,7 +575,7 @@ static bool handcrafted_check_tensors(const gguf_context * gguf_ctx, const unsig
bool ok = true;
const int id_alignment = gguf_find_key(gguf_ctx, "general.alignment");
const int id_alignment = gguf_find_key(gguf_ctx, GGUF_KEY_GENERAL_ALIGNMENT);
const uint32_t alignment = id_alignment >= 0 ? gguf_get_val_u32(gguf_ctx, id_alignment) : GGUF_DEFAULT_ALIGNMENT;
uint64_t expected_offset = 0;
@@ -607,7 +643,7 @@ static bool handcrafted_check_tensor_data(const gguf_context * gguf_ctx, const u
std::vector<uint8_t> data(size);
GGML_ASSERT(fseek(file, gguf_get_data_offset(gguf_ctx) + offset, SEEK_SET) == 0);
GGML_ASSERT(fread(data.data(), 1, size, file) == size);
GGML_ASSERT(fread(data.data(), 1, data.size(), file) == data.size());
for (size_t j = 0; j < size; ++j) {
const uint8_t expected_byte = (j + offset) % 256;
@@ -627,15 +663,15 @@ static std::pair<int, int> test_handcrafted_file(const unsigned int seed) {
const std::vector<handcrafted_file_type> hfts = {
HANDCRAFTED_HEADER_BAD_MAGIC,
HANDCRAFTED_HEADER_BAD_VERSION_1,
// HANDCRAFTED_FILE_TYPE_BAD_VERSION_FUTURE, // FIXME
HANDCRAFTED_HEADER_BAD_VERSION_FUTURE,
HANDCRAFTED_HEADER_BAD_N_KV,
HANDCRAFTED_HEADER_BAD_N_TENSORS,
HANDCRAFTED_HEADER_EMPTY,
HANDCRAFTED_KV_BAD_KEY_SIZE,
HANDCRAFTED_KV_BAD_TYPE,
// HANDCRAFTED_KV_BAD_VALUE_SIZE, // FIXME sanitizer limit
// HANDCRAFTED_FILE_TYPE_DUPLICATE_KEY, // FIXME
HANDCRAFTED_KV_DUPLICATE_KEY,
HANDCRAFTED_KV_BAD_ALIGN,
HANDCRAFTED_KV_SUCCESS,
HANDCRAFTED_TENSORS_BAD_NAME_SIZE,
@@ -643,14 +679,16 @@ static std::pair<int, int> test_handcrafted_file(const unsigned int seed) {
HANDCRAFTED_TENSORS_BAD_SHAPE,
HANDCRAFTED_TENSORS_NE_TOO_BIG,
HANDCRAFTED_TENSORS_BAD_TYPE,
// HANDCRAFTED_TENSORS_BAD_OFFSET, // FIXME
HANDCRAFTED_TENSORS_BAD_OFFSET,
HANDCRAFTED_TENSORS_DUPLICATE_NAME,
// HANDCRAFTED_TENSORS_BAD_ALIGNMENT, // FIXME
HANDCRAFTED_TENSORS_BAD_ALIGN,
HANDCRAFTED_TENSORS_INCONSISTENT_ALIGN,
HANDCRAFTED_TENSORS_SUCCESS,
HANDCRAFTED_TENSORS_CUSTOM_ALIGN,
HANDCRAFTED_DATA_NOT_ENOUGH_DATA,
// HANDCRAFTED_DATA_BAD_ALIGNMENT, // FIXME
HANDCRAFTED_DATA_BAD_ALIGN,
HANDCRAFTED_DATA_INCONSISTENT_ALIGN,
HANDCRAFTED_DATA_SUCCESS,
HANDCRAFTED_DATA_CUSTOM_ALIGN,
};
@@ -674,6 +712,7 @@ static std::pair<int, int> test_handcrafted_file(const unsigned int seed) {
/*no_alloc =*/ false,
/*ctx =*/ hft >= offset_has_data ? &ctx : nullptr,
};
struct gguf_context * gguf_ctx = gguf_init_from_file_impl(file, gguf_params);
if (expect_context_not_null(hft)) {
@@ -689,7 +728,7 @@ static std::pair<int, int> test_handcrafted_file(const unsigned int seed) {
}
ntest++;
if (false && hft >= offset_has_data && !expect_context_not_null(hft)) { // FIXME
if (hft >= offset_has_data && !expect_context_not_null(hft)) {
printf("%s: - no_dangling_ggml_context_pointer: ", __func__);
if (ctx) {
printf("\033[1;31mFAIL\033[0m\n");
@@ -700,23 +739,6 @@ static std::pair<int, int> test_handcrafted_file(const unsigned int seed) {
ntest++;
}
if (false && expect_context_not_null(hft)) { // FIXME
FILE * file_eb = get_handcrafted_file(seed, hft, /*extra_bytes =*/ 1);
struct gguf_context * gguf_ctx_eb = gguf_init_from_file_impl(file_eb, gguf_params);
printf("%s: - context_null_with_extra_bytes: ", __func__);
if (gguf_ctx_eb) {
printf("\033[1;31mFAIL\033[0m\n");
} else {
printf("\033[1;32mOK\033[0m\n");
npass++;
}
ntest++;
gguf_free(gguf_ctx_eb);
fclose(file_eb);
}
const bool alignment_defined = hft == HANDCRAFTED_TENSORS_CUSTOM_ALIGN || hft == HANDCRAFTED_DATA_CUSTOM_ALIGN;
if (expect_context_not_null(hft)) {
@@ -763,14 +785,15 @@ static std::pair<int, int> test_handcrafted_file(const unsigned int seed) {
ntest++;
}
fclose(file);
if (gguf_ctx) {
ggml_free(ctx);
gguf_free(gguf_ctx);
}
fclose(file);
printf("\n");
}
return std::make_pair(npass, ntest);
}
@@ -789,10 +812,6 @@ static struct random_gguf_context_result get_random_gguf_context(ggml_backend_t
const std::string key = "my_key_" + std::to_string(rng() % 1024);
const enum gguf_type type = gguf_type(rng() % GGUF_TYPE_COUNT);
if (type == GGUF_TYPE_STRING || type == GGUF_TYPE_ARRAY) {
continue; // FIXME memory leak
}
switch (type) {
case GGUF_TYPE_UINT8: gguf_set_val_u8 (gguf_ctx, key.c_str(), rng() % (1 << 7)); break;
case GGUF_TYPE_INT8: gguf_set_val_i8 (gguf_ctx, key.c_str(), rng() % (1 << 7) - (1 << 6)); break;
@@ -826,6 +845,9 @@ static struct random_gguf_context_result get_random_gguf_context(ggml_backend_t
std::vector<uint32_t> random_data((nbytes + sizeof(uint32_t) - 1) / sizeof(uint32_t));
for (size_t j = 0; j < random_data.size(); ++j) {
random_data[j] = rng();
if (type_arr == GGUF_TYPE_BOOL) {
random_data[j] &= 0x01010101; // the sanitizer complains if booleans are not 0 or 1
}
}
gguf_set_arr_data(gguf_ctx, key.c_str(), type_arr, random_data.data(), ne);
} break;
@@ -928,6 +950,17 @@ static bool all_kv_in_other(const gguf_context * ctx, const gguf_context * other
continue;
}
if (type_arr == GGUF_TYPE_BOOL) {
const int8_t * data = reinterpret_cast<const int8_t *>(gguf_get_arr_data(ctx, id));
const int8_t * data_other = reinterpret_cast<const int8_t *>(gguf_get_arr_data(other, idx_other));
for (int arr_i = 0; arr_i < arr_n; ++arr_i) {
if (bool(data[arr_i]) != bool(data_other[arr_i])) {
ok = false;
}
}
continue;
}
if (type_arr == GGUF_TYPE_STRING) {
for (int arr_i = 0; arr_i < arr_n; ++arr_i) {
const std::string str = gguf_get_arr_str(ctx, id, arr_i);
@@ -939,8 +972,8 @@ static bool all_kv_in_other(const gguf_context * ctx, const gguf_context * other
continue;
}
const char * data = reinterpret_cast<const char *>(gguf_get_arr_data(ctx, id));
const char * data_other = reinterpret_cast<const char *>(gguf_get_arr_data(other, idx_other));
const int8_t * data = reinterpret_cast<const int8_t *>(gguf_get_arr_data(ctx, id));
const int8_t * data_other = reinterpret_cast<const int8_t *>(gguf_get_arr_data(other, idx_other));
if (!std::equal(data, data + arr_n*gguf_type_size(type_arr), data_other)) {
ok = false;
}
@@ -1028,21 +1061,6 @@ static bool same_tensor_data(const struct ggml_context * orig, const struct ggml
}
static std::pair<int, int> test_roundtrip(ggml_backend_dev_t dev, const unsigned int seed, const bool only_meta) {
FILE * file = tmpfile();
#ifdef _WIN32
if (!file) {
printf("%s: failed to create tmpfile(), needs elevated privileges on Windows");
printf("%s: skipping tests");
return std::make_pair(0, 0);
}
#else
GGML_ASSERT(file);
#endif // _WIN32
if (ggml_backend_dev_type(dev) != GGML_BACKEND_DEVICE_TYPE_CPU) {
return std::make_pair(0, 0); // FIXME
}
ggml_backend_t backend = ggml_backend_dev_init(dev, nullptr);
printf("%s: device=%s, backend=%s, only_meta=%s\n",
__func__, ggml_backend_dev_description(dev), ggml_backend_name(backend), only_meta ? "yes" : "no");
@@ -1060,10 +1078,24 @@ static std::pair<int, int> test_roundtrip(ggml_backend_dev_t dev, const unsigned
bbuf = result.buffer;
}
struct gguf_buf gbuf = gguf_buf_init(16 * 1024);
gguf_write_to_buf(gguf_ctx_0, &gbuf, only_meta);
helper_write(gbuf.data, gbuf.offset, file);
rewind(file);
FILE * file = tmpfile();
#ifdef _WIN32
if (!file) {
printf("%s: failed to create tmpfile(), needs elevated privileges on Windows");
printf("%s: skipping tests");
return std::make_pair(0, 0);
}
#else
GGML_ASSERT(file);
#endif // _WIN32
{
std::vector<int8_t> buf;
gguf_write_to_buf(gguf_ctx_0, buf, only_meta);
GGML_ASSERT(fwrite(buf.data(), 1, buf.size(), file) == buf.size());
rewind(file);
}
struct ggml_context * ctx_1 = nullptr;
struct gguf_init_params gguf_params = {
@@ -1151,9 +1183,8 @@ static std::pair<int, int> test_roundtrip(ggml_backend_dev_t dev, const unsigned
ggml_free(ctx_1);
gguf_free(gguf_ctx_0);
gguf_free(gguf_ctx_1);
gguf_buf_free(gbuf);
ggml_backend_free(backend);
GGML_ASSERT(fclose(file) == 0);
fclose(file);
printf("\n");
return std::make_pair(npass, ntest);