Compare commits

..

30 Commits

Author SHA1 Message Date
Sigbjørn Skjæret 4b436e4e5e flake8 fix 2026-02-23 11:48:01 +01:00
Georgi Gerganov a6d3e9a239 ggml : relax asseerts for ggml_get_type_traits() 2026-02-22 21:37:58 +02:00
Georgi Gerganov 9c5d8dec37 gguf : add file size check for arrays 2026-02-22 21:36:56 +02:00
Georgi Gerganov c76408dbb9 gguf : add mem_size overflow test 2026-02-22 18:40:06 +02:00
Georgi Gerganov c79698f28a ggml : relax ggml_type asserts to debug-only 2026-02-22 16:32:39 +02:00
Georgi Gerganov 45250db0f8 ggml : remove deprecated ggml_type_sizef() 2026-02-22 16:23:57 +02:00
Georgi Gerganov dfac6caa40 ggml : print values when overflow 2026-02-22 16:09:53 +02:00
Georgi Gerganov 327e2ca6f2 gguf : minor print fix 2026-02-22 16:01:29 +02:00
Georgi Gerganov 09788740f3 gguf : fix ctx size for no_alloc == true 2026-02-22 15:54:44 +02:00
Georgi Gerganov 4e89ec67fa gguf : better name 2026-02-22 15:47:01 +02:00
Sigbjørn Skjæret 46a9a0656a enforce proper alignment in add_custom_alignment 2026-02-22 10:38:58 +01:00
Georgi Gerganov f2ac3ef57e py : restore tensor_fields 2026-02-22 10:54:14 +02:00
Georgi Gerganov 12c719b3f1 gguf-py : error on duplicate keys when reading 2026-02-22 09:48:16 +02:00
Georgi Gerganov 5d67acd422 ggml : check int overflow in ggml_new_tensor_impl and ggml_new_object 2026-02-22 09:48:16 +02:00
Georgi Gerganov d97dd299a0 py : assert that alignment is non-zero power of 2 2026-02-19 16:44:43 +02:00
Georgi Gerganov 2e23292cfe ggml : fix negative tensor type oob 2026-02-19 16:42:46 +02:00
Georgi Gerganov 7babe5fb13 gguf : prevent array elements exhaustion 2026-02-19 16:33:25 +02:00
Georgi Gerganov 357b8e50f1 gguf : prevent string exhaustion 2026-02-19 16:08:04 +02:00
Georgi Gerganov 69788e0d23 ggml : fix int overflows in ggml_new_object() 2026-02-19 15:59:09 +02:00
Georgi Gerganov 198f79d6c3 gguf : prevent integer overflow for ggml_context mem size 2026-02-19 15:51:00 +02:00
Georgi Gerganov da348c9dfb models : fix qwen3.5 beta/gate shapes (#19730)
* models : fix qwen3.5 beta/gate shapes

* cont : avoid extra reshapes
2026-02-19 15:19:53 +02:00
Saba Fallah e6267a9359 mtmd: build_attn modified, flash_attn on/off via ctx_params (#19729) 2026-02-19 13:50:29 +01:00
3 a l i 2bf318fd2f model : add JAIS-2 architecture support (#19488)
* model: add JAIS-2 architecture support

Add support for the JAIS-2 family of Arabic-English bilingual models
from Inception AI (https://huggingface.co/inceptionai/Jais-2-8B-Chat).

Architecture characteristics:
- LayerNorm (not RMSNorm) with biases
- ReLU² (ReLU squared) activation function
- Separate Q/K/V projections with biases
- Simple MLP without gate projection (up -> act -> down)
- RoPE positional embeddings
- GPT-2 BPE tokenizer

Supported model sizes:
- Jais-2-8B (32 layers, 26 heads, 3328 hidden)
- Jais-2-70B (68 layers, 56 heads, 7168 hidden)

Tested with quantizations: BF16, Q8_0, Q6_K, Q5_K_M, Q5_0, Q4_K_M, Q4_0, Q3_K_M, Q2_K

Note: JAIS-2 requires F32 precision accumulators for numerical stability
and uses standard attention (not flash attention) on CUDA backends.

* fix: run convert_hf_to_gguf_update.py for jais-2 tokenizer hash

* fix: use NEOX RoPE type for JAIS2

* fix: remove Q/K permutation (NEOX RoPE doesn't need it)

* fix: enable flash attention for JAIS2 (fixed by #19115)

* fix: add dedicated JAIS2 pre-tokenizer type and control vector support

- Add LLAMA_VOCAB_PRE_TYPE_JAIS2 with cascading whitespace regex
- Include original regex from tokenizer.json as comment
- Add build_cvec call for control vector support

* no longer necessary to override set_vocab

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-02-19 13:30:17 +01:00
Johannes Gäßler c78e682245 CUDA: fix kernel selection logic for tile FA (#19686)
* CUDA: fix kernel selection logic for tile FA

* add comment
2026-02-19 12:42:58 +01:00
Tarek Dakhran c5897995a7 mtmd : chat : Fix extra \n between text and media marker (#19595)
* mtmd : chat : Fix extra \n between text and media marker

Thanks to @tugot17 for detecting and reporting the issue.

For vision models (e.g. LFM2.5-VL-1.6B and Qwen/Qwen3-VL-4B-Instruct) `llama-mtmd-cli` produces identical output to HF implementation.

However `llama-server` doesn't. I traced it down to extra newline
inserted after `<__media__>`.

This happens in `to_json_oaicompat`, that treats media markers as text
and joins all parts with `\n` separator.

PR introduces new type `media_marker` and uses it for media markers.
Extra logic is added to prevent insertion of newlines before and after
media markers.

With this change number of input tokens is identical to HF
implementation and as a result the output is also identical.

I explored other ways to address the issue
* remove completely `\n` between text parts in `to_json_oaicompat`
* merge text messages in server-common.cpp before sending them to `to_json_oaicompat`

Please propose alternative ways of fixing this issue.

* Refactor to use explicite per type ifs

* Update common/chat.cpp

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

* Update common_chat_templates_apply_legacy

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
2026-02-19 12:18:57 +01:00
Aleksander Grygier 03fd9d3bb4 webui: Fix Attachments not being included in completion request (#19731)
* fix: Add missing argument

* chore: update webui build output
2026-02-19 10:27:38 +01:00
Tarek Dakhran 8004f3a8d1 model : add tokenizer from LFM2.5-Audio-1.5B (#19687)
* model : Add tokenizer from LFM2.5-Audio-1.5B

[LFM2.5-Audio-1.5B](https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B) introduced lightweight audio tokenizer.

Tokenizer based on LFM2 architecture and acts as "embedding" model with
different input `n_embd` and output `n_embd_out`.

To be used in https://github.com/ggml-org/llama.cpp/pull/18641.

To convert use

```shell
python3 convert_hf_to_gguf.py /path/to/LFM2.5-Audio-1.5B/audio_detokenizer
```

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Formatting

* Rework check for attention layers

* Add LFM2 SWA model support

* Address PR feedback

* Set vocab to none

* Move helper function definitions to cpp file

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-02-19 09:54:48 +01:00
Daniel Bevenius eacb4b67a2 llama : use output_resolve_row() in get_logits_ith/get_embeddings_ith (#19663)
This commit updates get_logits_ith(), and get_embeddings_ith() to use
output_resolve_row() to resolve the batch index to output row index.

The motivation for this is to remove some code duplication between these
functions.
2026-02-19 09:48:08 +01:00
Ryan Mangeno c0d0430340 model : full modern bert support (#18330)
* full modern bert support

* added gelu op in rank pooling for modern bert

* still working on stuff, added mean calculation before classifier head

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* first layer is dense, as per modern bert research paper

* Update src/llama-graph.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fixed set input for mean pooling to check if pooling type is ranking since modern bert does mean & rank

* Update src/llama-graph.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-02-19 08:52:21 +01:00
shalinib-ibm 3bb2fcc856 llamafile: powerpc: add FP16 MMA path for Q4/Q8 matmul (#19709)
Avoid xvi8ger4pp signed→unsigned bias correction by dequantizing Q4/Q8
inputs to FP16 and using FP16×FP16→FP32 MMA. This removes
post-processing overhead and improves performance.

Performance Impact:
1.5 ~ 2x improvement in PP_Speed for Q4 and Q8 Models,
measured with llama-bench and llama-batched-bench.
Q8 Model: granite-4.0-h-micro-Q8_0.gguf (from huggingface)
Q4 Model: Meta-Llama3-8b Q4 model (generated with llama-quantize from
f32 model)

llama-bench Q8 Model Results:
 model                          	       size 	     params 	 backend    	 threads 	            test 	Base t/s	Patch t/s
 granitehybrid 3B Q8_0          	   3.16 GiB 	     3.19 B 	 CPU        	      10 	             pp8 	         64.48 ± 4.72 	         73.99 ± 0.27
 granitehybrid 3B Q8_0          	   3.16 GiB 	     3.19 B 	 CPU        	      10 	            pp16 	         80.11 ± 0.32 	        112.53 ± 0.40
 granitehybrid 3B Q8_0          	   3.16 GiB 	     3.19 B 	 CPU        	      10 	            pp32 	         89.10 ± 0.27 	        152.95 ± 0.68
 granitehybrid 3B Q8_0          	   3.16 GiB 	     3.19 B 	 CPU        	      10 	            pp64 	         93.65 ± 0.25 	        187.83 ± 0.83
 granitehybrid 3B Q8_0          	   3.16 GiB 	     3.19 B 	 CPU        	      10 	           pp128 	         99.93 ± 0.02 	        201.32 ± 0.11
 granitehybrid 3B Q8_0          	   3.16 GiB 	     3.19 B 	 CPU        	      10 	           pp256 	        102.32 ± 0.40 	        208.32 ± 0.41
 granitehybrid 3B Q8_0          	   3.16 GiB 	     3.19 B 	 CPU        	      10 	           pp512 	        103.42 ± 0.40 	        209.98 ± 0.14
 granitehybrid 3B Q8_0          	   3.16 GiB 	     3.19 B 	 CPU        	      10 	           tg128 	         20.35 ± 0.01 	         19.57 ± 0.01

llama-bench Q4 Model Results:
 model                          	       size 	     params 	 backend    	 threads 	            test 	              Base    t/s 	               Patch   t/s
 llama 8B Q4_0                  	   4.33 GiB 	     8.03 B 	 CPU        	      10 	             pp8 	         34.77 ± 0.10 	         41.23 ± 0.08
 llama 8B Q4_0                  	   4.33 GiB 	     8.03 B 	 CPU        	      10 	            pp16 	         40.81 ± 0.04 	         64.55 ± 0.15
 llama 8B Q4_0                  	   4.33 GiB 	     8.03 B 	 CPU        	      10 	            pp32 	         44.65 ± 0.05 	         90.84 ± 0.22
 llama 8B Q4_0                  	   4.33 GiB 	     8.03 B 	 CPU        	      10 	            pp64 	         47.49 ± 0.03 	        114.39 ± 0.11
 llama 8B Q4_0                  	   4.33 GiB 	     8.03 B 	 CPU        	      10 	           pp128 	         49.29 ± 0.24 	        120.13 ± 0.19
 llama 8B Q4_0                  	   4.33 GiB 	     8.03 B 	 CPU        	      10 	           pp256 	         49.77 ± 0.23 	        121.51 ± 0.11
 llama 8B Q4_0                  	   4.33 GiB 	     8.03 B 	 CPU        	      10 	           pp512 	         49.89 ± 0.23 	        117.52 ± 0.10
 llama 8B Q4_0                  	   4.33 GiB 	     8.03 B 	 CPU        	      10 	           tg128 	         13.40 ± 0.01 	         13.37 ± 0.00

Llama perplexity Results:

Model	                    Base Final PPL Estimate	Patch Final PPL Estimate
granite-4.0-h-micro-Q8_0    1.3862 +/- 0.04424	        1.3868 +/- 0.04432
Meta-Llama3-8b Q4	    1.3801 +/- 0.04116	        1.3803 +/- 0.04116

Signed-off-by: Shalini.Salomi.Bodapati <Shalini.Salomi.Bodapati@ibm.com>
2026-02-19 14:28:53 +08:00
38 changed files with 1175 additions and 743 deletions
+15 -4
View File
@@ -65,14 +65,25 @@ json common_chat_msg::to_json_oaicompat(bool concat_typed_text) const {
} else if (!content_parts.empty()) {
if (concat_typed_text) {
std::string text;
bool last_was_media_marker = false;
// join parts with newline, do not add newline before or after media markers
for (const auto & part : content_parts) {
if (part.type != "text") {
bool add_new_line = true;
if (part.type == "text") {
add_new_line = !last_was_media_marker && !text.empty();
last_was_media_marker = false;
} else if (part.type == "media_marker") {
add_new_line = false;
last_was_media_marker = true;
} else {
LOG_WRN("Ignoring content part type: %s\n", part.type.c_str());
continue;
}
if (!text.empty()) {
if (add_new_line) {
text += '\n';
}
text += part.text;
}
jmsg["content"] = text;
@@ -319,7 +330,7 @@ std::vector<common_chat_msg> common_chat_msgs_parse_oaicompat(const json & messa
throw std::invalid_argument("Missing content part type: " + part.dump());
}
const auto & type = part.at("type");
if (type != "text") {
if (type != "text" && type != "media_marker") {
throw std::invalid_argument("Unsupported content part type: " + type.dump());
}
common_chat_msg_content_part msg_part;
@@ -3307,7 +3318,7 @@ static common_chat_params common_chat_templates_apply_legacy(
for (const auto & msg : inputs.messages) {
auto content = msg.content;
for (const auto & part : msg.content_parts) {
if (part.type != "text") {
if (part.type != "text" && part.type != "media_marker") {
LOG_WRN("Ignoring non-text content part: %s\n", part.type.c_str());
continue;
}
+45 -5
View File
@@ -1163,6 +1163,9 @@ class TextModel(ModelBase):
if chkhsh == "b53802fb28e26d645c3a310b34bfe07da813026ec7c7716883404d5e0f8b1901":
# ref: https://huggingface.co/core42/jais-13b
res = "jais"
if chkhsh == "bc5108ee1eb6a3d600cadd065f63190fbd0554dbc9e4bbd6a0d977970afc8d2a":
# ref: https://huggingface.co/inceptionai/Jais-2-8B-Chat
res = "jais-2"
if chkhsh == "7b3e7548e4308f52a76e8229e4e6cc831195d0d1df43aed21ac6c93da05fec5f":
# ref: https://huggingface.co/WisdomShell/CodeShell-7B
res = "codeshell"
@@ -8633,6 +8636,17 @@ class T5EncoderModel(TextModel):
yield from super().modify_tensors(data_torch, name, bid)
@ModelBase.register("Jais2ForCausalLM")
class Jais2Model(TextModel):
model_arch = gguf.MODEL_ARCH.JAIS2
def set_gguf_parameters(self):
super().set_gguf_parameters()
hparams = self.hparams
head_dim = hparams.get("head_dim", hparams["hidden_size"] // hparams["num_attention_heads"])
self.gguf_writer.add_rope_dimension_count(head_dim)
@ModelBase.register("JAISLMHeadModel")
class JaisModel(TextModel):
model_arch = gguf.MODEL_ARCH.JAIS
@@ -10726,7 +10740,7 @@ class LFM2Model(TextModel):
def set_gguf_parameters(self):
# set num_key_value_heads only for attention layers
self.hparams["num_key_value_heads"] = [
self.hparams["num_key_value_heads"] if layer_type == "full_attention" else 0
self.hparams["num_key_value_heads"] if layer_type != "conv" else 0
for layer_type in self.hparams["layer_types"]
]
@@ -10912,6 +10926,28 @@ class LFM2AudioModel(ConformerAudioModel):
yield from super().modify_tensors(data_torch, name, bid)
@ModelBase.register("Lfm25AudioTokenizer")
class LFM25AudioTokenizer(LFM2Model):
model_arch = gguf.MODEL_ARCH.LFM2
def set_vocab(self):
self._set_vocab_none()
def set_gguf_parameters(self):
super().set_gguf_parameters()
self.gguf_writer.add_sliding_window(self.hparams["sliding_window"])
self.gguf_writer.add_embedding_length_out(self.hparams["output_size"])
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
if name == "istft.window" or name.startswith("emb.emb"):
return
if name.startswith("lin"):
name = name.replace("lin", "dense_2_out")
yield from super().modify_tensors(data_torch, name, bid)
@ModelBase.register("SmallThinkerForCausalLM")
class SmallThinkerModel(TextModel):
model_arch = gguf.MODEL_ARCH.SMALLTHINKER
@@ -11003,13 +11039,17 @@ class ModernBertModel(BertModel):
self.gguf_writer.add_vocab_size(self.hparams["vocab_size"])
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
# these layers act as MLM head, so we don't need them
if name.startswith("decoder."):
return
if name.startswith("model."):
name = name[6:]
if self.cls_out_labels:
# For BertForSequenceClassification (direct projection layer)
if name == "classifier.weight":
name = "classifier.out_proj.weight"
if name == "classifier.bias":
name = "classifier.out_proj.bias"
yield from super().modify_tensors(data_torch, name, bid)
+1
View File
@@ -114,6 +114,7 @@ models = [
{"name": "gemma", "tokt": TOKENIZER_TYPE.SPM, "repo": "https://huggingface.co/google/gemma-2b", },
{"name": "gemma-2", "tokt": TOKENIZER_TYPE.SPM, "repo": "https://huggingface.co/google/gemma-2-9b", },
{"name": "jais", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/core42/jais-13b", },
{"name": "jais-2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/inceptionai/Jais-2-8B-Chat", },
{"name": "t5", "tokt": TOKENIZER_TYPE.UGM, "repo": "https://huggingface.co/google-t5/t5-small", },
{"name": "codeshell", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/WisdomShell/CodeShell-7B", },
{"name": "tekken", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/mistralai/Mistral-Nemo-Base-2407", },
-4
View File
@@ -730,10 +730,6 @@ extern "C" {
GGML_API size_t ggml_type_size(enum ggml_type type); // size in bytes for all elements in a block
GGML_API size_t ggml_row_size (enum ggml_type type, int64_t ne); // size in bytes for all elements in a row
GGML_DEPRECATED(
GGML_API double ggml_type_sizef(enum ggml_type type), // ggml_type_size()/ggml_blck_size() as float
"use ggml_row_size() instead");
GGML_API const char * ggml_type_name(enum ggml_type type);
GGML_API const char * ggml_op_name (enum ggml_op op);
GGML_API const char * ggml_op_symbol(enum ggml_op op);
-333
View File
@@ -1,333 +0,0 @@
#pragma once
typedef vector unsigned char vec_t;
typedef __vector_quad acc_t;
template <typename TA>
class tinyBLAS_Q0_PPC {
public:
tinyBLAS_Q0_PPC(int64_t k,
const TA *A, int64_t lda,
const block_q8_0 *B, int64_t ldb,
float *C, int64_t ldc,
int ith, int nth);
void matmul(int64_t m, int64_t n);
void matmul_tiled_q0(int64_t m, int64_t n, int64_t mc, int64_t nc, int64_t kc) {
vec_t A_pack[mc*kc*2];
vec_t B_pack[nc*kc*2];
int comparray[mc*kc];
constexpr bool is_Ablock_q4 = std::is_same_v<TA, block_q4_0>;
int64_t ytiles = m / mc;
int64_t xtiles = n / nc;
int64_t tiles = xtiles * ytiles;
int64_t duty = (tiles + nth - 1) / nth;
int64_t start = duty * ith;
int64_t end = start + duty;
if (end > tiles) {
end = tiles;
}
for (int64_t job = start; job < end; ++job) {
int64_t ii = (job / xtiles) * mc;
int64_t jj = (job % xtiles) * nc;
for (int64_t kk = 0; kk < k; kk += kc) {
if constexpr(is_Ablock_q4) {
packNormalInt4_large(A + ii*lda + kk, lda, mc, 4, (int8_t*)A_pack, comparray);
} else {
packNormal_large<int8_t, vector signed char>(A + ii*lda + kk, lda, mc, 8, (int8_t*)A_pack, false, comparray);
}
packNormal_large<uint8_t, vector unsigned char>(B + jj*ldb + kk, ldb, nc, 8, (uint8_t*)B_pack, true);
KERNEL_Q0(ii, jj, mc, nc, kc, kk, A_pack, B_pack, comparray);
}
}
}
private:
inline void save_res(int ii, int jj, int idx, vector float* fin_res, int RM=4, int RN=4) {
for (int I = 0; I < RM; I++) {
for (int J = 0; J < RN; J++) {
*((float*)(C+ii+((jj+J)*ldc)+I)) = *((float*)&fin_res[idx+I]+J);
}
}
}
inline void add_save_res(int ii, int jj, int idx, vector float* fin_res, int RM=4, int RN=4) {
for (int I = 0; I < RM; I++) {
for (int J = 0; J < RN; J++) {
float * c_ptr = (float *)(C+ii+((jj+J)*ldc)+I);
*c_ptr += *((float*)&fin_res[idx+I]+J);
}
}
}
template<typename ArrayType>
inline void compute(acc_t* ACC, int c_idx, int s_idx, ArrayType& comparray, vector float* vs, vector float* fin_res) {
vector signed int vec_C[4];
vector float CA[4] = {0};
vector float res[4] = {0};
__builtin_mma_disassemble_acc(vec_C, ACC);
for (int i = 0; i < 4; i++) {
CA[i] = vec_splats((float)(((double)comparray[c_idx+i]) * -128.0));
res[i] = vec_add(vec_ctf(vec_C[i], 0), CA[i]);
fin_res[s_idx+i] = vec_madd(res[i], vs[s_idx+i], fin_res[s_idx+i]);
}
}
inline void process_q4_elements(vector signed char (&c)[2], int* ca) {
const vector signed char lowMask = vec_splats((signed char)0xF);
const vector unsigned char v4 = vec_splats((unsigned char)0x4);
const vector signed char v8 = vec_splats((signed char)0x8);
vector signed int vsum = {0};
vector signed int vsum2 = {0};
c[0] = vec_and(c[1], lowMask);
c[1] = vec_sr(c[1], v4);
c[0] = vec_sub(c[0], v8);
c[1] = vec_sub(c[1], v8);
vsum = vec_sum4s(c[0], vsum);
vsum2 = vec_sum4s(c[1], vsum2);
vsum = vec_add(vsum, vsum2);
*(ca) = vsum[0] + vsum[1] + vsum[2] + vsum[3];
}
template <typename V1, typename V2>
inline void vector_permute_store(V2 &s1, V2 &s2, V2 &s3, V2 &s4, V1 *vecOffset, bool flip) {
vector unsigned char swiz1 = {0, 1, 2, 3, 4, 5, 6, 7, 16, 17, 18, 19, 20, 21, 22, 23};
vector unsigned char swiz2 = {8, 9, 10, 11, 12, 13, 14, 15, 24, 25, 26, 27, 28, 29, 30, 31};
vector unsigned char swiz3 = {0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27};
vector unsigned char swiz4 = {4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31};
V2 t1, t2, t3, t4, t5, t6, t7, t8;
vector unsigned char xor_vector;
uint8_t flip_vec = 0x80;
xor_vector = vec_splats(flip_vec);
t1 = vec_perm(s1, s2, swiz1);
t2 = vec_perm(s1, s2, swiz2);
t3 = vec_perm(s3, s4, swiz1);
t4 = vec_perm(s3, s4, swiz2);
t5 = vec_perm(t1, t3, swiz3);
t6 = vec_perm(t1, t3, swiz4);
t7 = vec_perm(t2, t4, swiz3);
t8 = vec_perm(t2, t4, swiz4);
if (flip == true) {
t5 = vec_xor(t5, xor_vector);
t6 = vec_xor(t6, xor_vector);
t7 = vec_xor(t7, xor_vector);
t8 = vec_xor(t8, xor_vector);
}
vec_xst(t5, 0, vecOffset);
vec_xst(t6, 0, vecOffset+16);
vec_xst(t7, 0, vecOffset+32);
vec_xst(t8, 0, vecOffset+48);
}
template<int RM, int RN>
inline void kernel(int64_t ii, int64_t jj) {
if constexpr(RM == 4 && RN == 8) {
KERNEL_4x8(ii,jj);
} else if constexpr(RM == 8 && RN == 4) {
KERNEL_8x4(ii,jj);
} else if constexpr(RM == 8 && RN == 8) {
KERNEL_8x8(ii,jj);
} else {
assert(false && "RN/RM values not supported");
}
}
template<int size>
void packNormalInt4(const TA* a, int64_t lda, int rows, int cols, int8_t* vec, std::array<int, size>& comparray);
template<typename VA, typename VB>
void packNormal(const block_q8_0* a, int64_t lda, int rows, int cols, VA* vec, bool flip);
void mnpack(int64_t m0, int64_t m, int64_t n0, int64_t n);
void KERNEL_4x8(int64_t ii, int64_t jj);
void KERNEL_8x4(int64_t ii, int64_t jj);
void KERNEL_8x8(int64_t ii, int64_t jj);
void gemm_small(int64_t m0, int64_t m, int64_t n0, int64_t n, int RM, int RN);
template <int RM, int RN>
void gemm(int64_t m0, int64_t m, int64_t n0, int64_t n);
void compute_scale(int64_t ii, int64_t jj, int blk, vector float* vs){
for (int I = 0; I<8; I++) {
float a_scale = unhalf((A+((ii+I)*lda)+blk)->d);
for (int J = 0; J<4; J++) {
*((float*)&vs[I]+J) = (a_scale * unhalf((B+((jj+J)*ldb)+blk)->d));
*((float*)&vs[I+8]+J) = (a_scale * unhalf((B+((jj+J+4)*ldb)+blk)->d));
}
}
}
inline void process_q8_elements(const int8_t *qs, int *ca) {
vector signed char c1 = vec_xl(0, qs);
vector signed char c2 = vec_xl(16, qs);
vector signed int vsum1 = {0};
vector signed int vsum2 = {0};
vsum1 = vec_sum4s(c1, vsum1);
vsum2 = vec_sum4s(c2, vsum2);
vector signed int vsum = vec_add(vsum1, vsum2);
*ca = vsum[0] + vsum[1] + vsum[2] + vsum[3];
}
template<typename VA, typename VB>
void packNormal_large(const block_q8_0* a, int64_t lda, int rows, int cols, VA* vec, bool flip, int* comparray=nullptr) {
int64_t i, j;
block_q8_0 *aoffset = NULL;
VA *vecOffset = NULL;
block_q8_0* aoffsets[8];
__vector_pair arr[8];
VB c[8][2] = {0};
VB c1[8] = {0}; VB c2[8] = {0};
aoffset = const_cast<block_q8_0*>(a);
vecOffset = vec;
j = (rows >> 3);
int index = 0;
if (j > 0) {
do {
for (int it = 0; it < 8; it++)
aoffsets[it] = aoffset + it*lda;
aoffset += 8 * lda;
for (int blk = 0; blk < kc; blk++) {
for (int it = 0; it < 8; it++) {
arr[it] = __builtin_vsx_lxvp(0, (__vector_pair*)(aoffsets[it]+blk)->qs);
__builtin_vsx_disassemble_pair(c[it], &arr[it]);
c1[it] = c[it][0];
c2[it] = c[it][1];
if (comparray){
process_q8_elements((aoffsets[it]+ blk)->qs, &comparray[index + 8*blk + it]);
}
}
vector_permute_store<VA, VB>(c1[0], c1[1], c1[2], c1[3], vecOffset, flip);
vector_permute_store<VA, VB>(c2[0], c2[1], c2[2], c2[3], vecOffset+64, flip);
vector_permute_store<VA, VB>(c1[4], c1[5], c1[6], c1[7], vecOffset+128, flip);
vector_permute_store<VA, VB>(c2[4], c2[5], c2[6], c2[7], vecOffset+192, flip);
vecOffset += 256;
}
j--;
index += 8*kc;
} while(j > 0);
}
}
void packNormalInt4_large(const TA* a, int64_t lda, int rows, int cols, int8_t* vec, int*comparray) {
int64_t i, j;
TA *aoffset = NULL;
int8_t *vecOffset = NULL;
TA *aoffset1 = NULL, *aoffset2 = NULL, *aoffset3 = NULL, *aoffset4 = NULL;
TA *aoffset5 = NULL, *aoffset6 = NULL, *aoffset7 = NULL, *aoffset8 = NULL;
vector signed char c1[2] = {0}, c2[2] = {0}, c3[2] = {0}, c4[2] = {0};
vector signed char c5[2] = {0}, c6[2] = {0}, c7[2] = {0}, c8[2] = {0};
aoffset = const_cast<TA*>(a);
vecOffset = vec;
int index = 0;
j = (rows >> 3);
if (j > 0) {
do {
aoffset1 = aoffset;
aoffset2 = aoffset1 + lda;
aoffset3 = aoffset2 + lda;
aoffset4 = aoffset3 + lda;
aoffset5 = aoffset4 + lda;
aoffset6 = aoffset5 + lda;
aoffset7 = aoffset6 + lda;
aoffset8 = aoffset7 + lda;
aoffset += 8 * lda;
for (int blk = 0; blk < kc; blk++) {
c1[1] = reinterpret_cast<vector signed char>(vec_xl(0, (aoffset1+blk)->qs));
c2[1] = reinterpret_cast<vector signed char>(vec_xl(0, (aoffset2+blk)->qs));
c3[1] = reinterpret_cast<vector signed char>(vec_xl(0, (aoffset3+blk)->qs));
c4[1] = reinterpret_cast<vector signed char>(vec_xl(0, (aoffset4+blk)->qs));
c5[1] = reinterpret_cast<vector signed char>(vec_xl(0, (aoffset5+blk)->qs));
c6[1] = reinterpret_cast<vector signed char>(vec_xl(0, (aoffset6+blk)->qs));
c7[1] = reinterpret_cast<vector signed char>(vec_xl(0, (aoffset7+blk)->qs));
c8[1] = reinterpret_cast<vector signed char>(vec_xl(0, (aoffset8+blk)->qs));
process_q4_elements(c1, &comparray[index + 8*blk+0]);
process_q4_elements(c2, &comparray[index + 8*blk+1]);
process_q4_elements(c3, &comparray[index + 8*blk+2]);
process_q4_elements(c4, &comparray[index + 8*blk+3]);
process_q4_elements(c5, &comparray[index + 8*blk+4]);
process_q4_elements(c6, &comparray[index + 8*blk+5]);
process_q4_elements(c7, &comparray[index + 8*blk+6]);
process_q4_elements(c8, &comparray[index + 8*blk+7]);
vector_permute_store<int8_t, vector signed char>(c1[0], c2[0], c3[0], c4[0], vecOffset, false);
vector_permute_store<int8_t, vector signed char>(c1[1], c2[1], c3[1], c4[1], vecOffset+64, false);
vector_permute_store<int8_t, vector signed char>(c5[0], c6[0], c7[0], c8[0], vecOffset+128, false);
vector_permute_store<int8_t, vector signed char>(c5[1], c6[1], c7[1], c8[1], vecOffset+192, false);
vecOffset += 256;
}
j--;
index += 8*kc;
} while (j > 0);
}
}
void KERNEL_Q0(int64_t ii, int64_t jj, int64_t mc, int64_t nc, int64_t kc, int64_t l, vec_t *vec_A, vec_t *vec_B, int *comparray) {
acc_t acc[8];
for (int i = 0; i < mc ; i += 8) {
for (int j = 0; j < nc; j += 8) {
vector float fin_res[16] = {0};
vector float vs[16] = {0};
for (int64_t kk = 0; kk < kc; kk+=2) {
for (int x = 0; x < 8; x++) {
__builtin_mma_xxsetaccz(&acc[x]);
}
int A_block_idx = (i/8)*(16*kc) + kk*16;
int B_block_idx = (j/8)*(16*kc)+ kk*16;
vec_t *A_block = &vec_A[A_block_idx];
vec_t *B_block = &vec_B[B_block_idx];
for (int x = 0; x < 8; x++) {
__builtin_mma_xvi8ger4pp(&acc[0], A_block[x], B_block[x]);
__builtin_mma_xvi8ger4pp(&acc[1], A_block[x + 8], B_block[x]);
__builtin_mma_xvi8ger4pp(&acc[2], A_block[x], B_block[x+8]);
__builtin_mma_xvi8ger4pp(&acc[3], A_block[x+8], B_block[x+8]);
}
compute_scale(ii+i, jj+j, l+kk, vs);
int c_index = (i/8)*(8*kc)+ kk*8;
int* c_block = &comparray[c_index];
compute(&acc[0], 0, 0, c_block, vs, fin_res);
compute(&acc[1], 4, 4, c_block, vs, fin_res);
compute(&acc[2], 0, 8, c_block, vs, fin_res);
compute(&acc[3], 4, 12, c_block, vs, fin_res);
A_block_idx = (i/8)*(16*kc) + (kk+1)*16;
B_block_idx = (j/8)*(16*kc)+ (kk+1)*16;
A_block = &vec_A[A_block_idx];
B_block = &vec_B[B_block_idx];
for (int x = 0; x < 8; x++) {
__builtin_mma_xvi8ger4pp(&acc[4], A_block[x], B_block[x]);
__builtin_mma_xvi8ger4pp(&acc[5], A_block[x + 8], B_block[x]);
__builtin_mma_xvi8ger4pp(&acc[6], A_block[x], B_block[x+8]);
__builtin_mma_xvi8ger4pp(&acc[7], A_block[x+8], B_block[x+8]);
}
compute_scale(ii+i, jj+j, l+kk+1, vs);
c_index = (i/8)*(8*kc)+ (kk+1)*8;
c_block = &comparray[c_index];
compute(&acc[4], 0, 0, c_block, vs, fin_res);
compute(&acc[5], 4, 4, c_block, vs, fin_res);
compute(&acc[6], 0, 8, c_block, vs, fin_res);
compute(&acc[7], 4, 12, c_block, vs, fin_res);
}
if (l == 0) {
save_res(ii+i, jj+j, 0, fin_res);
save_res(ii+i+4, jj+j, 4, fin_res);
save_res(ii+i, jj+j+4, 8, fin_res);
save_res(ii+i+4, jj+j+4, 12, fin_res);
} else {
add_save_res(ii+i, jj+j, 0, fin_res);
add_save_res(ii+i+4, jj+j, 4, fin_res);
add_save_res(ii+i, jj+j+4, 8, fin_res);
add_save_res(ii+i+4, jj+j+4, 12, fin_res);
}
}
}
}
const TA *const A;
const block_q8_0 *const B;
float *C;
const int64_t k;
int64_t kc;
const int64_t lda;
const int64_t ldb;
const int64_t ldc;
const int ith;
const int nth;
};
+507 -155
View File
@@ -121,7 +121,8 @@ inline float32x4_t mul(float32x4_t x, float32x4_t y) { return vec_mul(x, y); }
#endif
#if defined(__MMA__)
#include "sgemm-ppc.h"
typedef vector unsigned char vec_t;
typedef __vector_quad acc_t;
#endif
////////////////////////////////////////////////////////////////////////////////////////////////////
// VECTORIZED FUSED MULTIPLY ADD
@@ -2153,7 +2154,7 @@ class tinyBLAS_HP16_PPC {
packNormal((B+(jj*ldb)+l), ldb, 8, 4, (uint8_t*)vec_B);
for (int x = 0; x < 4; x++) {
mma_instr<TA>::outer_product(&acc_0, vec_A[x], vec_B[x]);
mma_instr<TA>::outer_product(&acc_1, vec_A[x], vec_B[x+4]);
mma_instr<TA>::outer_product(&acc_1, vec_A[x+4], vec_B[x]);
}
}
SAVE_ACC(&acc_0, ii, jj);
@@ -2301,43 +2302,299 @@ class tinyBLAS_HP16_PPC {
const int nth;
};
template <typename TA>
tinyBLAS_Q0_PPC<TA>::tinyBLAS_Q0_PPC(int64_t k,
const TA *A, int64_t lda,
const block_q8_0 *B, int64_t ldb,
float *C, int64_t ldc,
int ith, int nth)
template <typename TA>
class tinyBLAS_Q0_PPC {
public:
tinyBLAS_Q0_PPC(int64_t k,
const TA * A, int64_t lda,
const block_q8_0 * B, int64_t ldb,
float * C, int64_t ldc,
int ith, int nth)
: A(A), B(B), C(C), k(k), lda(lda), ldb(ldb), ldc(ldc), ith(ith), nth(nth) {
kc = 64;
}
template<typename TA>
void tinyBLAS_Q0_PPC<TA>::matmul(int64_t m, int64_t n) {
int mc = 64; int nc = 64;
if (n % 8 == 0 && n < nc) {
nc = n;
mc = 32 ;
kc = 32;
void matmul(int64_t m, int64_t n) {
const int64_t mc = 64;
const int64_t kc = 64;
int64_t nc = 64;
int64_t n_aligned = 0;
if (n % 64 == 0) {
n_aligned = n;
} else if (n == 4) {
n_aligned = 4;
} else if (n < 64) {
n_aligned = (n / 8) * 8;
} else {
n_aligned = (n / 64) * 64;
}
const bool is_aligned = ((m & (mc - 1)) == 0) & ((n & (nc - 1)) == 0) & ((k & (kc - 1)) == 0);
if (is_aligned) {
this->matmul_tiled_q0(m, n, mc, nc, kc);
if (n_aligned > 0) {
if (n_aligned % 64 == 0) nc = 64;
else if (n_aligned == n) nc = n;
else if (n_aligned % 32 == 0) nc = 32;
else if (n_aligned % 24 == 0) nc = 24;
else if (n_aligned % 16 == 0) nc = 16;
else nc = 8;
}
bool can_use_tiled = n_aligned > 0 && (m % mc == 0) && (k % kc == 0);
if (can_use_tiled) {
matmul_tiled(m, n_aligned, mc, nc, kc);
if (n > n_aligned) {
mnpack(0, m, n_aligned, n);
}
} else {
mnpack(0, m, 0, n);
}
}
template<typename TA>
template<int size>
void tinyBLAS_Q0_PPC<TA>::packNormalInt4(const TA* a, int64_t lda, int rows, int cols, int8_t* vec, std::array<int, size>& comparray) {
private:
inline void save_res(int ii, int jj, int idx, vector float * fin_res, int RM = 4, int RN = 4) {
for (int I = 0; I < RM; I++) {
for (int J = 0; J < RN; J++) {
*((float *)(C + ii + ((jj + J) * ldc) + I)) = *((float *)&fin_res[idx + I] + J);
}
}
}
inline void save_acc(acc_t * ACC, int64_t ii, int64_t jj) {
vec_t vec_C[4];
__builtin_mma_disassemble_acc(vec_C, ACC);
for (int I = 0; I < 4; I++) {
for (int J = 0; J < 4; J++) {
*((float *)(C + ii + ((jj + J) * ldc) + I)) = *((float *)&vec_C[I] + J);
}
}
}
inline void add_save_acc(acc_t * ACC, int64_t ii, int64_t jj) {
vec_t vec_C[4];
__builtin_mma_disassemble_acc(vec_C, ACC);
for (int I = 0; I < 4; I++) {
for (int J = 0; J < 4; J++) {
float * c_ptr = (float *)(C + ii+ ((jj + J) * ldc) + I);
*c_ptr += *((float *)&vec_C[I] + J);
}
}
}
template<typename ArrayType>
inline void compute(acc_t * ACC, int c_idx, int s_idx, ArrayType & comparray, vector float * vs, vector float * fin_res) {
vector signed int vec_C[4];
vector float CA[4] = {0};
vector float res[4] = {0};
__builtin_mma_disassemble_acc(vec_C, ACC);
for (int i = 0; i < 4; i++) {
CA[i] = vec_splats((float)(((double)comparray[c_idx + i]) * -128.0));
res[i] = vec_add(vec_ctf(vec_C[i], 0), CA[i]);
fin_res[s_idx + i] = vec_madd(res[i], vs[s_idx + i], fin_res[s_idx + i]);
}
}
inline void process_q4_elements(vector signed char (&c)[2], int * ca) {
const vector signed char lowMask = vec_splats((signed char)0xF);
const vector unsigned char v4 = vec_splats((unsigned char)0x4);
const vector signed char v8 = vec_splats((signed char)0x8);
vector signed int vsum = {0};
vector signed int vsum2 = {0};
c[0] = vec_and(c[1], lowMask);
c[1] = vec_sr(c[1], v4);
c[0] = vec_sub(c[0], v8);
c[1] = vec_sub(c[1], v8);
vsum = vec_sum4s(c[0], vsum);
vsum2 = vec_sum4s(c[1], vsum2);
vsum = vec_add(vsum, vsum2);
*(ca) = vsum[0] + vsum[1] + vsum[2] + vsum[3];
}
template <typename V1, typename V2>
inline void vector_permute_store(V2 & s1, V2 & s2, V2 & s3, V2 & s4, V1 * vecOffset, bool flip) {
vector unsigned char swiz1 = {0, 1, 2, 3, 4, 5, 6, 7, 16, 17, 18, 19, 20, 21, 22, 23};
vector unsigned char swiz2 = {8, 9, 10, 11, 12, 13, 14, 15, 24, 25, 26, 27, 28, 29, 30, 31};
vector unsigned char swiz3 = {0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27};
vector unsigned char swiz4 = {4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31};
V2 t1, t2, t3, t4, t5, t6, t7, t8;
vector unsigned char xor_vector;
uint8_t flip_vec = 0x80;
xor_vector = vec_splats(flip_vec);
t1 = vec_perm(s1, s2, swiz1);
t2 = vec_perm(s1, s2, swiz2);
t3 = vec_perm(s3, s4, swiz1);
t4 = vec_perm(s3, s4, swiz2);
t5 = vec_perm(t1, t3, swiz3);
t6 = vec_perm(t1, t3, swiz4);
t7 = vec_perm(t2, t4, swiz3);
t8 = vec_perm(t2, t4, swiz4);
if (flip == true) {
t5 = vec_xor(t5, xor_vector);
t6 = vec_xor(t6, xor_vector);
t7 = vec_xor(t7, xor_vector);
t8 = vec_xor(t8, xor_vector);
}
vec_xst(t5, 0, vecOffset);
vec_xst(t6, 0, vecOffset + 16);
vec_xst(t7, 0, vecOffset + 32);
vec_xst(t8, 0, vecOffset + 48);
}
inline void unpack_q4_to_q8(vector signed char packed, vector signed char & lo, vector signed char & hi) {
const vector signed char lowMask = vec_splats((signed char)0x0F);
const vector signed char v8 = vec_splats((signed char)0x08);
const vector unsigned char v4 = vec_splats((unsigned char)4);
lo = vec_and(packed, lowMask);
hi = vec_sr(packed, v4);
lo = vec_sub(lo, v8);
hi = vec_sub(hi, v8);
}
inline void vector_permute_store_fp16(vec_t * c, unsigned char * vecOffset) {
vec_t t[8], s[8];
vec_t swiz1 = {0, 1, 2, 3, 16, 17, 18, 19, 4, 5, 6, 7, 20, 21, 22, 23};
vec_t swiz2 = {8, 9, 10, 11, 24, 25, 26, 27, 12, 13, 14, 15, 28, 29, 30, 31};
vec_t swiz3 = {0, 1, 2, 3, 4, 5, 6, 7, 16, 17, 18, 19, 20, 21, 22, 23};
vec_t swiz4 = {8, 9, 10, 11, 12, 13, 14, 15, 24, 25, 26, 27, 28, 29, 30, 31};
for (int i = 0; i < 4; i += 2) {
t[i + 0] = vec_perm(c[i + 0], c[i + 1], swiz1);
t[i + 1] = vec_perm(c[i + 0], c[i + 1], swiz2);
}
for (int i = 4; i < 8; i += 2) {
t[i + 0] = vec_perm(c[i + 0], c[i + 1], swiz1);
t[i + 1] = vec_perm(c[i + 0], c[i + 1], swiz2);
}
s[0] = vec_perm(t[0], t[2], swiz3);
s[1] = vec_perm(t[0], t[2], swiz4);
s[2] = vec_perm(t[1], t[3], swiz3);
s[3] = vec_perm(t[1], t[3], swiz4);
s[4] = vec_perm(t[4], t[6], swiz3);
s[5] = vec_perm(t[4], t[6], swiz4);
s[6] = vec_perm(t[5], t[7], swiz3);
s[7] = vec_perm(t[5], t[7], swiz4);
for (int i = 0; i < 8; ++i) {
vec_xst(s[i], 0, (vec_t *)(vecOffset + i * 16));
}
}
static inline void convert_and_scale_q8(vector signed char raw, vector float v_scale, vector unsigned short & out_hi, vector unsigned short & out_lo) {
vector signed short i16_hi = vec_unpackh(raw);
vector signed short i16_lo = vec_unpackl(raw);
vector float f_hi_h = vec_ctf(vec_unpackh(i16_hi), 0);
vector float f_hi_l = vec_ctf(vec_unpackl(i16_hi), 0);
vector float f_lo_h = vec_ctf(vec_unpackh(i16_lo), 0);
vector float f_lo_l = vec_ctf(vec_unpackl(i16_lo), 0);
out_hi = vec_pack_to_short_fp32(vec_mul(f_hi_h, v_scale), vec_mul(f_hi_l, v_scale));
out_lo = vec_pack_to_short_fp32(vec_mul(f_lo_h, v_scale), vec_mul(f_lo_l, v_scale));
}
void packNormal_q4_fp16(const block_q4_0 * a, int64_t lda, int rows, int blocks, unsigned char * vec) {
unsigned char * vecOffset = vec;
for (int i = 0; i < rows; i += 8) {
const block_q4_0 * rows_base[8];
for (int r = 0; r < 8; r++) {
rows_base[r] = a + (i + r) * lda;
}
for (int blk = 0; blk < blocks; blk++) {
vector unsigned short hp_res[8][4];
for (int r = 0; r < 8; r++) {
const block_q4_0 * current_blk = rows_base[r] + blk;
vector float v_scale = vec_extract_fp32_from_shorth(vec_splats(current_blk->d));
vector signed char v_qs = reinterpret_cast<vector signed char>(vec_xl(0, current_blk->qs));
vector signed char c1, c2;
unpack_q4_to_q8(v_qs, c1, c2);
convert_and_scale_q8(c1, v_scale, hp_res[r][0], hp_res[r][1]);
convert_and_scale_q8(c2, v_scale, hp_res[r][2], hp_res[r][3]);
}
for (int c = 0; c < 4; c++) {
vector unsigned char c_arr[8];
for (int r = 0; r < 8; r++) {
c_arr[r] = (vector unsigned char)hp_res[r][c];
}
vector_permute_store_fp16((vec_t *)c_arr, vecOffset);
vecOffset += 128;
}
}
}
}
template <int chunk_size>
static inline void pack_q8_block(const block_q8_0 * a, int64_t lda, int rows, int blocks, unsigned char * vec) {
unsigned char * vecOffset = vec;
const vec_t swiz1 = {0, 1, 2, 3, 16, 17, 18, 19, 4, 5, 6, 7, 20, 21, 22, 23};
const vec_t swiz2 = {8, 9, 10, 11, 24, 25, 26, 27, 12, 13, 14, 15, 28, 29, 30, 31};
const vec_t swiz3 = {0, 1, 2, 3, 4, 5, 6, 7, 16, 17, 18, 19, 20, 21, 22, 23};
const vec_t swiz4 = {8, 9, 10, 11, 12, 13, 14, 15, 24, 25, 26, 27, 28, 29, 30, 31};
for (int i = 0; i < rows; i += chunk_size) {
const block_q8_0 * rows_base[chunk_size];
for (int r = 0; r < chunk_size; r++) {
rows_base[r] = a + (i + r) * lda;
}
for (int blk = 0; blk < blocks; blk++) {
vector unsigned short hp_res[chunk_size][4];
for (int r = 0; r < chunk_size; r++) {
const block_q8_0 * b = rows_base[r] + blk;
vector float v_scale = vec_extract_fp32_from_shorth(vec_splats(b->d));
vector signed char c[2];
__vector_pair pair = __builtin_vsx_lxvp(0, (__vector_pair *)b->qs);
__builtin_vsx_disassemble_pair(c, & pair);
convert_and_scale_q8(c[0], v_scale, hp_res[r][0], hp_res[r][1]);
convert_and_scale_q8(c[1], v_scale, hp_res[r][2], hp_res[r][3]);
}
for (int col = 0; col < 4; col++) {
if constexpr (chunk_size == 8) {
vec_t t[8];
t[0] = vec_perm((vec_t)hp_res[0][col], (vec_t)hp_res[1][col], swiz1);
t[1] = vec_perm((vec_t)hp_res[0][col], (vec_t)hp_res[1][col], swiz2);
t[2] = vec_perm((vec_t)hp_res[2][col], (vec_t)hp_res[3][col], swiz1);
t[3] = vec_perm((vec_t)hp_res[2][col], (vec_t)hp_res[3][col], swiz2);
t[4] = vec_perm((vec_t)hp_res[4][col], (vec_t)hp_res[5][col], swiz1);
t[5] = vec_perm((vec_t)hp_res[4][col], (vec_t)hp_res[5][col], swiz2);
t[6] = vec_perm((vec_t)hp_res[6][col], (vec_t)hp_res[7][col], swiz1);
t[7] = vec_perm((vec_t)hp_res[6][col], (vec_t)hp_res[7][col], swiz2);
vec_xst(vec_perm(t[0], t[2], swiz3), 0, (vec_t *)(vecOffset + 0));
vec_xst(vec_perm(t[0], t[2], swiz4), 0, (vec_t *)(vecOffset + 16));
vec_xst(vec_perm(t[1], t[3], swiz3), 0, (vec_t *)(vecOffset + 32));
vec_xst(vec_perm(t[1], t[3], swiz4), 0, (vec_t *)(vecOffset + 48));
vec_xst(vec_perm(t[4], t[6], swiz3), 0, (vec_t *)(vecOffset + 64));
vec_xst(vec_perm(t[4], t[6], swiz4), 0, (vec_t *)(vecOffset + 80));
vec_xst(vec_perm(t[5], t[7], swiz3), 0, (vec_t *)(vecOffset + 96));
vec_xst(vec_perm(t[5], t[7], swiz4), 0, (vec_t *)(vecOffset + 112));
vecOffset += 128;
} else {
vec_t t0 = vec_perm((vec_t)hp_res[0][col], (vec_t)hp_res[1][col], swiz1);
vec_t t1 = vec_perm((vec_t)hp_res[0][col], (vec_t)hp_res[1][col], swiz2);
vec_t t2 = vec_perm((vec_t)hp_res[2][col], (vec_t)hp_res[3][col], swiz1);
vec_t t3 = vec_perm((vec_t)hp_res[2][col], (vec_t)hp_res[3][col], swiz2);
vec_xst(vec_perm(t0, t2, swiz3), 0, (vec_t *)(vecOffset + 0));
vec_xst(vec_perm(t0, t2, swiz4), 0, (vec_t *)(vecOffset + 16));
vec_xst(vec_perm(t1, t3, swiz3), 0, (vec_t *)(vecOffset + 32));
vec_xst(vec_perm(t1, t3, swiz4), 0, (vec_t *)(vecOffset + 48));
vecOffset += 64;
}
}
}
}
}
void packNormal_q8_fp16(const block_q8_0 * a, int64_t lda, int rows, int blocks, unsigned char * vec) {
if (rows == 4) {
pack_q8_block<4>(a, lda, rows, blocks, vec);
} else {
pack_q8_block<8>(a, lda, rows, blocks, vec);
}
}
template<int size>
void packNormalInt4(const TA * a, int64_t lda, int rows, int cols, int8_t * vec, std::array<int, size> & comparray) {
int64_t i, j;
TA *aoffset = NULL;
int8_t *vecOffset = NULL;
TA *aoffset1 = NULL, *aoffset2 = NULL, *aoffset3 = NULL, *aoffset4 = NULL;
TA *aoffset5 = NULL, *aoffset6 = NULL, *aoffset7 = NULL, *aoffset8 = NULL;
TA * aoffset = NULL;
int8_t * vecOffset = NULL;
TA * aoffset1 = NULL, * aoffset2 = NULL, * aoffset3 = NULL, * aoffset4 = NULL;
TA * aoffset5 = NULL, * aoffset6 = NULL, * aoffset7 = NULL, * aoffset8 = NULL;
vector signed char c1[2] = {0}, c2[2] = {0}, c3[2] = {0}, c4[2] = {0};
vector signed char c5[2] = {0}, c6[2] = {0}, c7[2] = {0}, c8[2] = {0};
aoffset = const_cast<TA*>(a);
aoffset = const_cast<TA *>(a);
vecOffset = vec;
j = (rows >> 3);
if (j > 0) {
@@ -2363,18 +2620,18 @@ class tinyBLAS_HP16_PPC {
c7[1] = reinterpret_cast<vector signed char>(vec_xl(0, aoffset7->qs));
c8[1] = reinterpret_cast<vector signed char>(vec_xl(0, aoffset8->qs));
process_q4_elements(c1, &comparray[0]);
process_q4_elements(c2, &comparray[1]);
process_q4_elements(c3, &comparray[2]);
process_q4_elements(c4, &comparray[3]);
process_q4_elements(c5, &comparray[4]);
process_q4_elements(c6, &comparray[5]);
process_q4_elements(c7, &comparray[6]);
process_q4_elements(c8, &comparray[7]);
process_q4_elements(c1, & comparray[0]);
process_q4_elements(c2, & comparray[1]);
process_q4_elements(c3, & comparray[2]);
process_q4_elements(c4, & comparray[3]);
process_q4_elements(c5, & comparray[4]);
process_q4_elements(c6, & comparray[5]);
process_q4_elements(c7, & comparray[6]);
process_q4_elements(c8, & comparray[7]);
vector_permute_store<int8_t, vector signed char>(c1[0], c2[0], c3[0], c4[0], vecOffset, false);
vector_permute_store<int8_t, vector signed char>(c1[1], c2[1], c3[1], c4[1], vecOffset+64, false);
vector_permute_store<int8_t, vector signed char>(c5[0], c6[0], c7[0], c8[0], vecOffset+128, false);
vector_permute_store<int8_t, vector signed char>(c5[1], c6[1], c7[1], c8[1], vecOffset+192, false);
vector_permute_store<int8_t, vector signed char>(c1[1], c2[1], c3[1], c4[1], vecOffset + 64, false);
vector_permute_store<int8_t, vector signed char>(c5[0], c6[0], c7[0], c8[0], vecOffset + 128, false);
vector_permute_store<int8_t, vector signed char>(c5[1], c6[1], c7[1], c8[1], vecOffset + 192, false);
aoffset1 += lda;
aoffset2 += lda;
aoffset3 += lda;
@@ -2405,12 +2662,12 @@ class tinyBLAS_HP16_PPC {
c3[1] = reinterpret_cast<vector signed char>(vec_xl(0, aoffset3->qs));
c4[1] = reinterpret_cast<vector signed char>(vec_xl(0, aoffset4->qs));
process_q4_elements(c1, &comparray[0]);
process_q4_elements(c2, &comparray[1]);
process_q4_elements(c3, &comparray[2]);
process_q4_elements(c4, &comparray[3]);
process_q4_elements(c1, & comparray[0]);
process_q4_elements(c2, & comparray[1]);
process_q4_elements(c3, & comparray[2]);
process_q4_elements(c4, & comparray[3]);
vector_permute_store<int8_t, vector signed char>(c1[0], c2[0], c3[0], c4[0], vecOffset, false);
vector_permute_store<int8_t, vector signed char>(c1[1], c2[1], c3[1], c4[1], vecOffset+64, false);
vector_permute_store<int8_t, vector signed char>(c1[1], c2[1], c3[1], c4[1], vecOffset + 64, false);
aoffset1 += lda;
aoffset2 += lda;
aoffset3 += lda;
@@ -2434,12 +2691,12 @@ class tinyBLAS_HP16_PPC {
case 1: c1[1] = reinterpret_cast<vector signed char>(vec_xl(0, aoffset1->qs));
break;
}
process_q4_elements(c1, &comparray[0]);
process_q4_elements(c2, &comparray[1]);
process_q4_elements(c3, &comparray[2]);
process_q4_elements(c4, &comparray[3]);
process_q4_elements(c1, & comparray[0]);
process_q4_elements(c2, & comparray[1]);
process_q4_elements(c3, & comparray[2]);
process_q4_elements(c4, & comparray[3]);
vector_permute_store<int8_t, vector signed char>(c1[0], c2[0], c3[0], c4[0], vecOffset, false);
vector_permute_store<int8_t, vector signed char>(c1[1], c2[1], c3[1], c4[1], vecOffset+64, false);
vector_permute_store<int8_t, vector signed char>(c1[1], c2[1], c3[1], c4[1], vecOffset + 64, false);
aoffset1 += lda;
aoffset2 += lda;
aoffset3 += lda;
@@ -2450,39 +2707,38 @@ class tinyBLAS_HP16_PPC {
}
}
template<typename TA>
template<typename VA, typename VB>
void tinyBLAS_Q0_PPC<TA>::packNormal(const block_q8_0* a, int64_t lda, int rows, int cols, VA* vec, bool flip) {
void packNormal(const block_q8_0 * a, int64_t lda, int rows, int cols, VA * vec, bool flip) {
int64_t i, j;
block_q8_0 *aoffset = NULL;
VA *vecOffset = NULL;
block_q8_0* aoffsets[8];
block_q8_0 * aoffset = NULL;
VA * vecOffset = NULL;
block_q8_0 * aoffsets[8];
__vector_pair arr[8];
VB c[8][2] = {0};
VB c1[8] = {0}; VB c2[8] = {0};
aoffset = const_cast<block_q8_0*>(a);
aoffset = const_cast<block_q8_0 *>(a);
vecOffset = vec;
j = (rows >> 3);
if (j > 0) {
do {
aoffsets[0] = aoffset;
for (int it = 1; it < 8; it++)
aoffsets[it] = aoffsets[it-1] + lda;
aoffsets[it] = aoffsets[it - 1] + lda;
aoffset += 8 * lda;
i = (cols >> 3);
if (i > 0) {
do {
for (int it = 0; it < 8; it++) {
arr[it] = __builtin_vsx_lxvp(0, (__vector_pair*)aoffsets[it]->qs);
__builtin_vsx_disassemble_pair(c[it], &arr[it]);
arr[it] = __builtin_vsx_lxvp(0, (__vector_pair *)aoffsets[it]->qs);
__builtin_vsx_disassemble_pair(c[it], & arr[it]);
c1[it] = c[it][0];
c2[it] = c[it][1];
}
vector_permute_store<VA, VB>(c1[0], c1[1], c1[2], c1[3], vecOffset, flip);
vector_permute_store<VA, VB>(c2[0], c2[1], c2[2], c2[3], vecOffset+64, flip);
vector_permute_store<VA, VB>(c1[4], c1[5], c1[6], c1[7], vecOffset+128, flip);
vector_permute_store<VA, VB>(c2[4], c2[5], c2[6], c2[7], vecOffset+192, flip);
vector_permute_store<VA, VB>(c2[0], c2[1], c2[2], c2[3], vecOffset + 64, flip);
vector_permute_store<VA, VB>(c1[4], c1[5], c1[6], c1[7], vecOffset + 128, flip);
vector_permute_store<VA, VB>(c2[4], c2[5], c2[6], c2[7], vecOffset + 192, flip);
for (int it = 0; it < 8; it++)
aoffsets[it] += lda;
vecOffset += 256;
@@ -2501,13 +2757,13 @@ class tinyBLAS_HP16_PPC {
if (i > 0) {
do {
for (int it = 0; it < 4; it++) {
arr[it] = __builtin_vsx_lxvp(0, (__vector_pair*)aoffsets[it]->qs);
__builtin_vsx_disassemble_pair(c[it], &arr[it]);
arr[it] = __builtin_vsx_lxvp(0, (__vector_pair *)aoffsets[it]->qs);
__builtin_vsx_disassemble_pair(c[it], & arr[it]);
c1[it] = c[it][0];
c2[it] = c[it][1];
}
vector_permute_store<VA, VB>(c1[0], c1[1], c1[2], c1[3], vecOffset, flip);
vector_permute_store<VA, VB>(c2[0], c2[1], c2[2], c2[3], vecOffset+64, flip);
vector_permute_store<VA, VB>(c2[0], c2[1], c2[2], c2[3], vecOffset + 64, flip);
for (int it = 0; it < 4; it++) {
aoffsets[it] += lda;
}
@@ -2520,24 +2776,24 @@ class tinyBLAS_HP16_PPC {
if (rows & 3) {
aoffsets[0] = aoffset;
for (int it = 1; it < 3; it++ )
aoffsets[it] = aoffsets[it-1] + lda;
aoffsets[it] = aoffsets[it - 1] + lda;
i = (cols >> 3);
if (i > 0) {
do {
switch(rows) {
case 3: arr[2] = __builtin_vsx_lxvp(0, (__vector_pair*)aoffsets[2]->qs);
__builtin_vsx_disassemble_pair(c[2], &arr[2]);
case 3: arr[2] = __builtin_vsx_lxvp(0, (__vector_pair *)aoffsets[2]->qs);
__builtin_vsx_disassemble_pair(c[2], & arr[2]);
c1[2] = c[2][0]; c2[2] = c[2][1];
case 2: arr[1] = __builtin_vsx_lxvp(0, (__vector_pair*)aoffsets[1]->qs);
__builtin_vsx_disassemble_pair(c[1], &arr[1]);
case 2: arr[1] = __builtin_vsx_lxvp(0, (__vector_pair *)aoffsets[1]->qs);
__builtin_vsx_disassemble_pair(c[1], & arr[1]);
c1[1] = c[1][0]; c2[1] = c[1][1];
case 1: arr[0] = __builtin_vsx_lxvp(0, (__vector_pair*)aoffsets[0]->qs);
__builtin_vsx_disassemble_pair(c[0], &arr[0]);
case 1: arr[0] = __builtin_vsx_lxvp(0, (__vector_pair *)aoffsets[0]->qs);
__builtin_vsx_disassemble_pair(c[0], & arr[0]);
c1[0] = c[0][0]; c2[0] = c[0][1];
break;
}
vector_permute_store<VA, VB>(c1[0], c1[1], c1[2], c1[3], vecOffset, flip);
vector_permute_store<VA, VB>(c2[0], c2[1], c2[2], c2[3], vecOffset+64, flip);
vector_permute_store<VA, VB>(c2[0], c2[1], c2[2], c2[3], vecOffset + 64, flip);
for (int it = 0; it < 3; it++)
aoffsets[it] += lda;
vecOffset += 128;
@@ -2547,8 +2803,7 @@ class tinyBLAS_HP16_PPC {
}
}
template<typename TA>
void tinyBLAS_Q0_PPC<TA>::mnpack(int64_t m0, int64_t m, int64_t n0, int64_t n) {
void mnpack(int64_t m0, int64_t m, int64_t n0, int64_t n) {
int m_rem = MIN(m - m0, 16);
int n_rem = MIN(n - n0, 16);
@@ -2585,8 +2840,7 @@ class tinyBLAS_HP16_PPC {
}
template<typename TA>
void tinyBLAS_Q0_PPC<TA>::KERNEL_4x8(int64_t ii, int64_t jj) {
void KERNEL_4x8(int64_t ii, int64_t jj) {
vec_t vec_A[8], vec_B[16] = {0};
acc_t acc_0, acc_1;
std::array<int, 4> comparray {};
@@ -2594,26 +2848,26 @@ class tinyBLAS_HP16_PPC {
vector float vs[8] = {0};
bool isAblock_q4 = std::is_same_v<TA, block_q4_0>;
for (int l = 0; l < k; l++) {
__builtin_mma_xxsetaccz(&acc_0);
__builtin_mma_xxsetaccz(&acc_1);
__builtin_mma_xxsetaccz(& acc_0);
__builtin_mma_xxsetaccz(& acc_1);
if (std::is_same_v<TA, block_q4_0>) {
packNormalInt4<4>((A+(ii*lda)+l), lda, 4, 4, (int8_t*)vec_A, comparray);
packNormalInt4<4>((A + (ii * lda) + l), lda, 4, 4, (int8_t *)vec_A, comparray);
} else {
packNormal<int8_t, vector signed char>((const block_q8_0*)(A+(ii*lda)+l), lda, 4, 8, (int8_t*)vec_A, false);
packNormal<int8_t, vector signed char>((const block_q8_0 *)(A + (ii * lda) + l), lda, 4, 8, (int8_t *)vec_A, false);
}
packNormal<uint8_t, vector unsigned char>((B+(jj*ldb)+l), ldb, 8, 8, (uint8_t*)vec_B, true);
packNormal<uint8_t, vector unsigned char>((B + (jj * ldb) + l), ldb, 8, 8, (uint8_t *)vec_B, true);
for(int x = 0; x < 8; x++) {
__builtin_mma_xvi8ger4pp(&acc_0, vec_A[x], vec_B[x]);
__builtin_mma_xvi8ger4pp(&acc_1, vec_A[x], vec_B[x+8]);
__builtin_mma_xvi8ger4pp(& acc_0, vec_A[x], vec_B[x]);
__builtin_mma_xvi8ger4pp(& acc_1, vec_A[x], vec_B[x+8]);
}
for (int I = 0; I<4; I++) {
for (int J = 0; J<4; J++) {
*((float*)&vs[I]+J) = (unhalf((A+((ii+I)*lda)+l)->d) * unhalf((B+((jj+J)*ldb)+l)->d));
*((float*)&vs[I+4]+J) = (unhalf((A+((ii+I)*lda)+l)->d) * unhalf((B+((jj+J+4)*ldb)+l)->d));
*((float *)& vs[I] + J) = (unhalf((A + ((ii + I) * lda) + l)->d) * unhalf((B + ((jj + J) * ldb) + l)->d));
*((float *)& vs[I + 4] + J) = (unhalf((A +((ii + I) * lda) + l)->d) * unhalf((B + ((jj + J + 4) * ldb) + l)->d));
}
}
if (!isAblock_q4) {
auto aoffset = A+(ii*lda)+l;
auto aoffset = A + (ii * lda) + l;
for (int i = 0; i < 4; i++) {
comparray[i] = 0;
int ca = 0;
@@ -2624,15 +2878,14 @@ class tinyBLAS_HP16_PPC {
aoffset += lda;
}
}
compute(&acc_0, 0, 0, comparray, vs, fin_res);
compute(&acc_1, 0, 4, comparray, vs, fin_res);
compute(& acc_0, 0, 0, comparray, vs, fin_res);
compute(& acc_1, 0, 4, comparray, vs, fin_res);
}
save_res(ii, jj, 0, fin_res);
save_res(ii, jj+4, 4, fin_res);
save_res(ii, jj + 4, 4, fin_res);
}
template<typename TA>
void tinyBLAS_Q0_PPC<TA>::KERNEL_8x4(int64_t ii, int64_t jj) {
void KERNEL_8x4(int64_t ii, int64_t jj) {
vec_t vec_A[16], vec_B[8] = {0};
acc_t acc_0, acc_1;
std::array<int, 8> comparray {};
@@ -2640,25 +2893,25 @@ class tinyBLAS_HP16_PPC {
vector float vs[8] = {0};
bool isAblock_q4 = std::is_same_v<TA, block_q4_0>;
for (int l = 0; l < k; l++) {
__builtin_mma_xxsetaccz(&acc_0);
__builtin_mma_xxsetaccz(&acc_1);
__builtin_mma_xxsetaccz(& acc_0);
__builtin_mma_xxsetaccz(& acc_1);
if (std::is_same_v<TA, block_q4_0>) {
packNormalInt4<8>((A+(ii*lda)+l), lda, 8, 4, (int8_t*)vec_A, comparray);
packNormalInt4<8>((A + (ii * lda) + l), lda, 8, 4, (int8_t *)vec_A, comparray);
} else {
packNormal<int8_t, vector signed char>((const block_q8_0*)(A+(ii*lda)+l), lda, 8, 8, (int8_t*)vec_A, false);
packNormal<int8_t, vector signed char>((const block_q8_0 *)(A + (ii * lda) + l), lda, 8, 8, (int8_t *)vec_A, false);
}
packNormal<uint8_t, vector unsigned char>((B+(jj*ldb)+l), ldb, 4, 8, (uint8_t*)vec_B, true);
packNormal<uint8_t, vector unsigned char>((B + (jj * ldb) + l), ldb, 4, 8, (uint8_t *)vec_B, true);
for(int x = 0; x < 8; x++) {
__builtin_mma_xvi8ger4pp(&acc_0, vec_A[x], vec_B[x]);
__builtin_mma_xvi8ger4pp(&acc_1, vec_A[x+8], vec_B[x]);
__builtin_mma_xvi8ger4pp(& acc_0, vec_A[x], vec_B[x]);
__builtin_mma_xvi8ger4pp(& acc_1, vec_A[x + 8], vec_B[x]);
}
for (int I = 0; I<8; I++) {
for (int J = 0; J<4; J++) {
*((float*)&vs[I]+J) = (unhalf((A+((ii+I)*lda)+l)->d) * unhalf((B+((jj+J)*ldb)+l)->d));
for (int I = 0; I < 8; I++) {
for (int J = 0; J < 4; J++) {
*((float *)&vs[I] + J) = (unhalf((A + ((ii + I) * lda) + l)->d) * unhalf((B + ((jj + J) * ldb) + l)->d));
}
}
if (!isAblock_q4) {
auto aoffset = A+(ii*lda)+l;
auto aoffset = A + (ii * lda) + l;
for (int i = 0; i < 8; i++) {
comparray[i] = 0;
int ca = 0;
@@ -2669,15 +2922,14 @@ class tinyBLAS_HP16_PPC {
aoffset += lda;
}
}
compute(&acc_0, 0, 0, comparray, vs, fin_res);
compute(&acc_1, 4, 4, comparray, vs, fin_res);
compute(& acc_0, 0, 0, comparray, vs, fin_res);
compute(& acc_1, 4, 4, comparray, vs, fin_res);
}
save_res(ii, jj, 0, fin_res);
save_res(ii+4, jj, 4, fin_res);
save_res(ii + 4, jj, 4, fin_res);
}
template<typename TA>
void tinyBLAS_Q0_PPC<TA>::KERNEL_8x8(int64_t ii, int64_t jj) {
void KERNEL_8x8(int64_t ii, int64_t jj) {
vec_t vec_A[16], vec_B[16] = {0};
acc_t acc_0, acc_1, acc_2, acc_3;
acc_t acc_4, acc_5, acc_6, acc_7;
@@ -2686,30 +2938,30 @@ class tinyBLAS_HP16_PPC {
vector float vs[16] = {0};
bool isAblock_q4 = std::is_same_v<TA, block_q4_0>;
for (int l = 0; l < k; l++) {
__builtin_mma_xxsetaccz(&acc_0);
__builtin_mma_xxsetaccz(&acc_1);
__builtin_mma_xxsetaccz(&acc_2);
__builtin_mma_xxsetaccz(&acc_3);
__builtin_mma_xxsetaccz(& acc_0);
__builtin_mma_xxsetaccz(& acc_1);
__builtin_mma_xxsetaccz(& acc_2);
__builtin_mma_xxsetaccz(& acc_3);
if (std::is_same_v<TA, block_q4_0>) {
packNormalInt4<8>((A+(ii*lda)+l), lda, 8, 4, (int8_t*)vec_A, comparray);
packNormalInt4<8>((A + (ii * lda) + l), lda, 8, 4, (int8_t *)vec_A, comparray);
} else {
packNormal<int8_t, vector signed char>((const block_q8_0*)(A+(ii*lda)+l), lda, 8, 8, (int8_t*)vec_A, false);
packNormal<int8_t, vector signed char>((const block_q8_0 *)(A + (ii * lda) + l), lda, 8, 8, (int8_t *)vec_A, false);
}
packNormal<uint8_t, vector unsigned char>((B+(jj*ldb)+l), ldb, 8, 8, (uint8_t*)vec_B, true);
packNormal<uint8_t, vector unsigned char>((B + (jj * ldb) + l), ldb, 8, 8, (uint8_t *)vec_B, true);
for(int x = 0; x < 8; x++) {
__builtin_mma_xvi8ger4pp(&acc_0, vec_A[x], vec_B[x]);
__builtin_mma_xvi8ger4pp(&acc_1, vec_A[x+8], vec_B[x]);
__builtin_mma_xvi8ger4pp(&acc_2, vec_A[x], vec_B[x+8]);
__builtin_mma_xvi8ger4pp(&acc_3, vec_A[x+8], vec_B[x+8]);
__builtin_mma_xvi8ger4pp(& acc_0, vec_A[x], vec_B[x]);
__builtin_mma_xvi8ger4pp(& acc_1, vec_A[x + 8], vec_B[x]);
__builtin_mma_xvi8ger4pp(& acc_2, vec_A[x], vec_B[x + 8]);
__builtin_mma_xvi8ger4pp(& acc_3, vec_A[x + 8], vec_B[x + 8]);
}
for (int I = 0; I<8; I++) {
for (int J = 0; J<4; J++) {
*((float*)&vs[I]+J) = (unhalf((A+((ii+I)*lda)+l)->d) * unhalf((B+((jj+J)*ldb)+l)->d));
*((float*)&vs[I+8]+J) = (unhalf((A+((ii+I)*lda)+l)->d) * unhalf((B+((jj+J+4)*ldb)+l)->d));
for (int I = 0; I < 8 ; I++) {
for (int J = 0; J < 4; J++) {
*((float *)& vs[I] + J) = (unhalf((A + ((ii + I) * lda) + l)->d) * unhalf((B + ((jj + J) * ldb) + l)->d));
*((float *)& vs[I + 8] + J) = (unhalf((A + ((ii + I) * lda) + l)->d) * unhalf((B + ((jj + J + 4) * ldb) + l)->d));
}
}
if (!isAblock_q4) {
auto aoffset = A+(ii*lda)+l;
auto aoffset = A + (ii * lda) + l;
for (int i = 0; i < 8; i++) {
comparray[i] = 0;
int ca = 0;
@@ -2720,19 +2972,99 @@ class tinyBLAS_HP16_PPC {
aoffset += lda;
}
}
compute(&acc_0, 0, 0, comparray, vs, fin_res);
compute(&acc_1, 4, 4, comparray, vs, fin_res);
compute(&acc_2, 0, 8, comparray, vs, fin_res);
compute(&acc_3, 4, 12, comparray, vs, fin_res);
compute(& acc_0, 0, 0, comparray, vs, fin_res);
compute(& acc_1, 4, 4, comparray, vs, fin_res);
compute(& acc_2, 0, 8, comparray, vs, fin_res);
compute(& acc_3, 4, 12, comparray, vs, fin_res);
}
save_res(ii, jj, 0, fin_res);
save_res(ii+4, jj, 4, fin_res);
save_res(ii, jj+4, 8, fin_res);
save_res(ii+4, jj+4, 12, fin_res);
save_res(ii + 4, jj, 4, fin_res);
save_res(ii, jj + 4, 8, fin_res);
save_res(ii + 4, jj + 4, 12, fin_res);
}
template<typename TA>
void tinyBLAS_Q0_PPC<TA>::gemm_small(int64_t m0, int64_t m, int64_t n0, int64_t n, int RM, int RN) {
void KERNEL_Q0(int64_t ii, int64_t jj, int64_t mc, int64_t nc, int64_t kc, int64_t l, vec_t * vec_A, vec_t * vec_B) {
acc_t acc[8];
for (int i = 0; i < mc ; i += 16) {
for (int j = 0; j < nc; j += 8) {
int A0_base = (i / 16) * (2 * 32 * kc);
int B0_base = (j / 8) * (32 * kc);
for (int x = 0; x < 8; x++) {
__builtin_mma_xxsetaccz(&acc[x]);
}
for (int64_t kk = 0; kk < kc; kk++) {
int A0_block_idx = A0_base + kk * 32;
int B0_block_idx = B0_base + kk * 32;
int A1_block_idx = A0_block_idx + 32 * kc;
int B1_block_idx = B0_block_idx + 32 * kc;
vec_t * A0_block = & vec_A[A0_block_idx];
vec_t * B0_block = & vec_B[B0_block_idx];
vec_t * A1_block = & vec_A[A1_block_idx];
for (int it = 0; it < 4; it++) {
for (int x = 0; x < 4; x++) {
__builtin_mma_xvf16ger2pp(& acc[0], A0_block[8 * it + x], B0_block[8 * it + x]);
__builtin_mma_xvf16ger2pp(& acc[1], A0_block[8 * it + x], B0_block[8 * it + x + 4]);
__builtin_mma_xvf16ger2pp(& acc[2], A0_block[8 * it + x + 4], B0_block[8 * it + x]);
__builtin_mma_xvf16ger2pp(& acc[3], A0_block[8 * it + x + 4], B0_block[8 * it + x + 4]);
__builtin_mma_xvf16ger2pp(& acc[4], A1_block[8 * it + x], B0_block[8 * it + x]);
__builtin_mma_xvf16ger2pp(& acc[5], A1_block[8 * it + x], B0_block[8 * it+ x + 4]);
__builtin_mma_xvf16ger2pp(& acc[6], A1_block[8 * it + x + 4], B0_block[8 * it + x]);
__builtin_mma_xvf16ger2pp(& acc[7], A1_block[8 * it + x + 4], B0_block[8 * it + x + 4]);
}
}
}
if (l == 0) {
save_acc(& acc[0], ii + i, jj + j);
save_acc(& acc[1], ii + i, jj + j + 4);
save_acc(& acc[2], ii + i + 4, jj + j);
save_acc(& acc[3], ii + i + 4, jj + j + 4);
save_acc(& acc[4], ii + i + 8, jj + j);
save_acc(& acc[5], ii + i + 8, jj + j + 4);
save_acc(& acc[6], ii + i + 12, jj + j);
save_acc(& acc[7], ii + i + 12, jj + j + 4);
} else {
add_save_acc(& acc[0], ii + i, jj + j);
add_save_acc(& acc[1], ii + i, jj + j + 4);
add_save_acc(& acc[2], ii + i + 4, jj + j);
add_save_acc(& acc[3], ii + i + 4, jj + j + 4);
add_save_acc(& acc[4], ii + i + 8, jj + j);
add_save_acc(& acc[5], ii + i + 8, jj + j + 4);
add_save_acc(& acc[6], ii + i + 12, jj + j);
add_save_acc(& acc[7], ii + i + 12, jj + j + 4);
}
}
}
}
void matmul_tiled(int64_t m, int64_t n, int64_t mc, int64_t nc, int64_t kc) {
vec_t A_pack[mc * kc * 4];
vec_t B_pack[nc * kc * 4];
constexpr bool is_Ablock_q4 = std::is_same_v<TA, block_q4_0>;
int64_t ytiles = m / mc;
int64_t xtiles = n / nc;
int64_t tiles = xtiles * ytiles;
int64_t duty = (tiles + nth - 1) / nth;
int64_t start = duty * ith;
int64_t end = start + duty;
if (end > tiles) {
end = tiles;
}
for (int64_t job = start; job < end; ++job) {
int64_t ii = (job / xtiles) * mc;
int64_t jj = (job % xtiles) * nc;
for (int64_t kk = 0; kk < k; kk += kc) {
if constexpr(is_Ablock_q4) {
packNormal_q4_fp16(A + ii * lda + kk, lda, mc, kc, (uint8_t *)A_pack);
} else {
packNormal_q8_fp16(A + ii * lda + kk, lda, mc, kc, (uint8_t *)A_pack);
}
packNormal_q8_fp16(B + jj * ldb + kk, ldb, nc, kc, (uint8_t *)B_pack);
KERNEL_Q0(ii, jj, mc, nc, kc, kk, A_pack, B_pack);
}
}
}
void gemm_small(int64_t m0, int64_t m, int64_t n0, int64_t n, int RM, int RN) {
int64_t ytiles = (m - m0) / RM;
int64_t xtiles = (n - n0) / RN;
int64_t tiles = xtiles * ytiles;
@@ -2754,32 +3086,32 @@ class tinyBLAS_HP16_PPC {
vector float fin_res[4] = {0};
vector float vs[4] = {0};
vector float CA[4] = {0};
__builtin_prefetch((A+(ii*lda)+0)->qs, 0, 1); // prefetch first value
__builtin_prefetch((B+(jj*ldb)+0)->qs, 0, 1); // prefetch first value
__builtin_prefetch((A + (ii * lda) + 0)->qs, 0, 1); // prefetch first value
__builtin_prefetch((B + (jj * ldb) + 0)->qs, 0, 1); // prefetch first value
for (int l = 0; l < k; l++) {
__builtin_prefetch((A+(ii*lda)+(l+1))->qs, 0, 1); // prefetch one loop ahead
__builtin_prefetch((B+(jj*ldb)+(l+1))->qs, 0, 1); // prefetch one loop ahead
__builtin_mma_xxsetaccz(&acc_0);
__builtin_prefetch((A + (ii * lda) + (l + 1))->qs, 0, 1); // prefetch one loop ahead
__builtin_prefetch((B + (jj * ldb) + (l + 1))->qs, 0, 1); // prefetch one loop ahead
__builtin_mma_xxsetaccz(& acc_0);
if (isAblock_q4) {
packNormalInt4<4>((A+(ii*lda)+l), lda, RM, 4, (int8_t*)vec_A, comparray);
packNormalInt4<4>((A + (ii * lda) + l), lda, RM, 4, (int8_t *)vec_A, comparray);
} else {
packNormal<int8_t, vector signed char>((const block_q8_0*)(A+(ii*lda)+l), lda, RM, 8, (int8_t*)vec_A, false);
packNormal<int8_t, vector signed char>((const block_q8_0 *)(A + (ii * lda) + l), lda, RM, 8, (int8_t *)vec_A, false);
}
packNormal<uint8_t, vector unsigned char>((B+(jj*ldb)+l), ldb, RN, 8, (uint8_t*)vec_B, true);
for(int x = 0; x < 8; x+=4) {
__builtin_mma_xvi8ger4pp(&acc_0, vec_A[x], vec_B[x]);
__builtin_mma_xvi8ger4pp(&acc_0, vec_A[x+1], vec_B[x+1]);
__builtin_mma_xvi8ger4pp(&acc_0, vec_A[x+2], vec_B[x+2]);
__builtin_mma_xvi8ger4pp(&acc_0, vec_A[x+3], vec_B[x+3]);
packNormal<uint8_t, vector unsigned char>((B + (jj * ldb) + l), ldb, RN, 8, (uint8_t *)vec_B, true);
for (int x = 0; x < 8; x += 4) {
__builtin_mma_xvi8ger4pp(& acc_0, vec_A[x], vec_B[x]);
__builtin_mma_xvi8ger4pp(& acc_0, vec_A[x + 1], vec_B[x + 1]);
__builtin_mma_xvi8ger4pp(& acc_0, vec_A[x + 2], vec_B[x + 2]);
__builtin_mma_xvi8ger4pp(& acc_0, vec_A[x + 3], vec_B[x + 3]);
}
for (int I = 0; I<RM; I++) {
for (int J = 0; J<RN; J++) {
*((float*)&vs[I]+J) = (unhalf((A+((ii+I)*lda)+l)->d) * unhalf((B+((jj+J)*ldb)+l)->d));
for (int I = 0; I < RM; I++) {
for (int J = 0; J < RN; J++) {
*((float*)&vs[I] + J) = (unhalf((A + ((ii + I) * lda) + l)->d) * unhalf((B + ((jj + J) * ldb) + l)->d));
}
}
__builtin_mma_disassemble_acc(vec_C, &acc_0);
__builtin_mma_disassemble_acc(vec_C, & acc_0);
if (!isAblock_q4) {
auto aoffset = A+(ii*lda)+l;
auto aoffset = A + (ii * lda) + l;
for (int i = 0; i < RM; i++) {
comparray[i] = 0;
int ca = 0;
@@ -2800,9 +3132,21 @@ class tinyBLAS_HP16_PPC {
}
}
template<typename TA>
template<int RM, int RN>
inline void kernel(int64_t ii, int64_t jj) {
if constexpr(RM == 4 && RN == 8) {
KERNEL_4x8(ii,jj);
} else if constexpr(RM == 8 && RN == 4) {
KERNEL_8x4(ii,jj);
} else if constexpr(RM == 8 && RN == 8) {
KERNEL_8x8(ii,jj);
} else {
assert(false && "RN/RM values not supported");
}
}
template <int RM, int RN>
NOINLINE void tinyBLAS_Q0_PPC<TA>::gemm(int64_t m0, int64_t m, int64_t n0, int64_t n) {
NOINLINE void gemm(int64_t m0, int64_t m, int64_t n0, int64_t n) {
int64_t ytiles = (m - m0) / RM;
int64_t xtiles = (n - n0) / RN;
int64_t tiles = xtiles * ytiles;
@@ -2814,12 +3158,20 @@ class tinyBLAS_HP16_PPC {
for (int64_t job = start; job < end; ++job) {
int64_t ii = m0 + job / xtiles * RM;
int64_t jj = n0 + job % xtiles * RN;
this->kernel<RM, RN>(ii, jj);
kernel<RM, RN>(ii, jj);
}
}
template class tinyBLAS_Q0_PPC<block_q4_0>;
template class tinyBLAS_Q0_PPC<block_q8_0>;
const TA * const A;
const block_q8_0 * const B;
float * C;
const int64_t k;
int64_t kc;
const int64_t lda;
const int64_t ldb;
const int64_t ldc;
const int ith;
const int nth;
};
class tinyBLAS_PPC {
public:
+3 -1
View File
@@ -1186,8 +1186,10 @@ static void launch_fattn_tile_switch_ncols2(ggml_backend_cuda_context & ctx, ggm
GGML_ASSERT(Q->ne[2] % K->ne[2] == 0);
const int gqa_ratio = Q->ne[2] / K->ne[2];
// On NVIDIA (Pascal and older) the GQA optimizations seem to be detrimental in some cases.
// However, for DKQ == 576, DV == 512 only the kernel variant with GQA optimizations is implemented.
const bool nvidia = GGML_CUDA_CC_IS_NVIDIA(ggml_cuda_info().devices[ggml_cuda_get_device()].cc);
const int gqa_limit = nvidia && gqa_ratio <= 4 ? 16 : INT_MAX;
const int gqa_limit = nvidia && gqa_ratio <= 4 && DV <= 256 ? 16 : INT_MAX;
const bool use_gqa_opt = mask && max_bias == 0.0f && Q->ne[1] <= gqa_limit && K->ne[1] % FATTN_KQ_STRIDE == 0;
if constexpr (DV == 512) {
+27 -6
View File
@@ -899,7 +899,8 @@ static const struct ggml_type_traits type_traits[GGML_TYPE_COUNT] = {
};
const struct ggml_type_traits * ggml_get_type_traits(enum ggml_type type) {
GGML_ASSERT(type < GGML_TYPE_COUNT);
assert(type >= 0);
assert(type < GGML_TYPE_COUNT);
return &type_traits[type];
}
@@ -1265,27 +1266,33 @@ size_t ggml_nbytes_pad(const struct ggml_tensor * tensor) {
}
int64_t ggml_blck_size(enum ggml_type type) {
assert(type >= 0);
assert(type < GGML_TYPE_COUNT);
return type_traits[type].blck_size;
}
size_t ggml_type_size(enum ggml_type type) {
assert(type >= 0);
assert(type < GGML_TYPE_COUNT);
return type_traits[type].type_size;
}
size_t ggml_row_size(enum ggml_type type, int64_t ne) {
assert(type >= 0);
assert(type < GGML_TYPE_COUNT);
assert(ne % ggml_blck_size(type) == 0);
return ggml_type_size(type)*ne/ggml_blck_size(type);
}
double ggml_type_sizef(enum ggml_type type) {
return ((double)(type_traits[type].type_size))/type_traits[type].blck_size;
}
const char * ggml_type_name(enum ggml_type type) {
return type < GGML_TYPE_COUNT ? type_traits[type].type_name : "NONE";
assert(type >= 0);
assert(type < GGML_TYPE_COUNT);
return type_traits[type].type_name;
}
bool ggml_is_quantized(enum ggml_type type) {
assert(type >= 0);
assert(type < GGML_TYPE_COUNT);
return type_traits[type].is_quantized;
}
@@ -1629,11 +1636,23 @@ static struct ggml_object * ggml_new_object(struct ggml_context * ctx, enum ggml
const size_t cur_end = cur_offs + cur_size;
// align to GGML_MEM_ALIGN
GGML_ASSERT(size <= SIZE_MAX - (GGML_MEM_ALIGN - 1));
size_t size_needed = GGML_PAD(size, GGML_MEM_ALIGN);
char * const mem_buffer = ctx->mem_buffer;
struct ggml_object * const obj_new = (struct ggml_object *)(mem_buffer + cur_end);
// integer overflow checks
if (cur_end > SIZE_MAX - size_needed) {
GGML_LOG_WARN("%s: overflow detected in cur_end (%zu) + size_needed (%zu)\n", __func__, cur_end, size_needed);
return NULL;
}
if (cur_end + size_needed > SIZE_MAX - GGML_OBJECT_SIZE) {
GGML_LOG_WARN("%s: overflow detected in cur_end (%zu) + size_needed (%zu) + GGML_OBJECT_SIZE (%zu)\n", __func__,
cur_end, size_needed, (size_t) GGML_OBJECT_SIZE);
return NULL;
}
if (cur_end + size_needed + GGML_OBJECT_SIZE > ctx->mem_size) {
GGML_LOG_WARN("%s: not enough space in the context's memory pool (needed %zu, available %zu)\n",
__func__, cur_end + size_needed + GGML_OBJECT_SIZE, ctx->mem_size);
@@ -1702,6 +1721,8 @@ static struct ggml_tensor * ggml_new_tensor_impl(
obj_alloc_size = data_size;
}
GGML_ASSERT(GGML_TENSOR_SIZE <= SIZE_MAX - obj_alloc_size);
struct ggml_object * const obj_new = ggml_new_object(ctx, GGML_OBJECT_TYPE_TENSOR, GGML_TENSOR_SIZE + obj_alloc_size);
GGML_ASSERT(obj_new);
+84 -7
View File
@@ -15,6 +15,9 @@
#include <string>
#include <vector>
#define GGUF_MAX_STRING_LENGTH (1024*1024*1024)
#define GGUF_MAX_ARRAY_ELEMENTS (1024*1024*1024)
template <typename T>
struct type_to_gguf_type;
@@ -228,6 +231,26 @@ struct gguf_reader {
template <typename T>
bool read(std::vector<T> & dst, const size_t n) const {
if (n > GGUF_MAX_ARRAY_ELEMENTS) {
return false;
}
const uint64_t nbytes = nbytes_remain();
if constexpr (std::is_same<T, std::string>::value) {
// strings are prefixed with their length, so we need to account for that
if (n > SIZE_MAX / sizeof(uint64_t)) {
return false;
}
if (nbytes < n * sizeof(uint64_t)) {
return false;
}
} else {
if (n > SIZE_MAX / sizeof(T)) {
return false;
}
if (nbytes < n * sizeof(T)) {
return false;
}
}
dst.resize(n);
for (size_t i = 0; i < dst.size(); ++i) {
if constexpr (std::is_same<T, bool>::value) {
@@ -277,13 +300,43 @@ struct gguf_reader {
if (!read(size)) {
return false;
}
dst.resize(size);
if (size > GGUF_MAX_STRING_LENGTH) {
GGML_LOG_ERROR("%s: string length %" PRIu64 " exceeds maximum %" PRIu64 "\n", __func__, size, (uint64_t) GGUF_MAX_STRING_LENGTH);
return false;
}
const uint64_t nbytes = nbytes_remain();
if (size > nbytes) {
GGML_LOG_ERROR("%s: string length %" PRIu64 " exceeds remaining file size %" PRIu64 " bytes\n", __func__, size, nbytes);
return false;
}
dst.resize(static_cast<size_t>(size));
return fread(dst.data(), 1, dst.length(), file) == dst.length();
}
bool read(void * dst, const size_t size) const {
return fread(dst, 1, size, file) == size;
}
// remaining bytes in the file
uint64_t nbytes_remain() const {
const long cur = ftell(file);
if (cur < 0) {
return 0;
}
if (fseek(file, 0, SEEK_END) != 0) {
fseek(file, cur, SEEK_SET);
return 0;
}
const long end = ftell(file);
if (end < 0) {
fseek(file, cur, SEEK_SET);
return 0;
}
fseek(file, cur, SEEK_SET);
return static_cast<uint64_t>(end - cur);
}
};
struct gguf_context * gguf_init_empty(void) {
@@ -568,8 +621,8 @@ struct gguf_context * gguf_init_from_file_impl(FILE * file, struct gguf_init_par
// check that tensor type is within defined range
if (info.t.type < 0 || info.t.type >= GGML_TYPE_COUNT) {
GGML_LOG_ERROR("%s: tensor '%s' has invalid ggml type %d (%s)\n",
__func__, info.t.name, info.t.type, ggml_type_name(info.t.type));
GGML_LOG_ERROR("%s: tensor '%s' has invalid ggml type %d. should be in [0, %d)\n",
__func__, info.t.name, info.t.type, GGML_TYPE_COUNT);
ok = false;
break;
}
@@ -657,10 +710,34 @@ struct gguf_context * gguf_init_from_file_impl(FILE * file, struct gguf_init_par
// the ggml_tensor structs to the appropriate locations in the binary blob
// compute the exact size needed for the new ggml_context
const size_t mem_size =
params.no_alloc ?
(n_tensors )*ggml_tensor_overhead() :
(n_tensors + 1)*ggml_tensor_overhead() + ctx->size;
size_t mem_size = 0;
if (params.no_alloc) {
if (n_tensors != 0 && SIZE_MAX / n_tensors < ggml_tensor_overhead()) {
GGML_LOG_ERROR("%s: memory size overflow while allocating ggml context\n", __func__);
gguf_free(ctx);
return nullptr;
}
const size_t overhead = n_tensors * ggml_tensor_overhead();
mem_size = overhead;
} else {
if ((n_tensors + 1) != 0 && SIZE_MAX / (n_tensors + 1) < ggml_tensor_overhead()) {
GGML_LOG_ERROR("%s: memory size overflow while allocating ggml context\n", __func__);
gguf_free(ctx);
return nullptr;
}
const size_t overhead = (n_tensors + 1) * ggml_tensor_overhead();
if (SIZE_MAX - overhead < ctx->size) {
GGML_LOG_ERROR("%s: memory size overflow while allocating ggml context\n", __func__);
gguf_free(ctx);
return nullptr;
}
mem_size = overhead + ctx->size;
}
struct ggml_init_params pdata = {
/*mem_size =*/ mem_size,
+18
View File
@@ -435,6 +435,7 @@ class MODEL_ARCH(IntEnum):
T5 = auto()
T5ENCODER = auto()
JAIS = auto()
JAIS2 = auto()
NEMOTRON = auto()
NEMOTRON_H = auto()
NEMOTRON_H_MOE = auto()
@@ -652,6 +653,7 @@ class MODEL_TENSOR(IntEnum):
ENC_OUTPUT_NORM = auto()
CLS = auto() # classifier
CLS_OUT = auto() # classifier output projection
CLS_NORM = auto()
CONV1D = auto()
CONVNEXT_DW = auto()
CONVNEXT_NORM = auto()
@@ -873,6 +875,7 @@ MODEL_ARCH_NAMES: dict[MODEL_ARCH, str] = {
MODEL_ARCH.T5: "t5",
MODEL_ARCH.T5ENCODER: "t5encoder",
MODEL_ARCH.JAIS: "jais",
MODEL_ARCH.JAIS2: "jais2",
MODEL_ARCH.NEMOTRON: "nemotron",
MODEL_ARCH.NEMOTRON_H: "nemotron_h",
MODEL_ARCH.NEMOTRON_H_MOE: "nemotron_h_moe",
@@ -1088,6 +1091,7 @@ TENSOR_NAMES: dict[MODEL_TENSOR, str] = {
MODEL_TENSOR.ENC_OUTPUT_NORM: "enc.output_norm",
MODEL_TENSOR.CLS: "cls",
MODEL_TENSOR.CLS_OUT: "cls.output",
MODEL_TENSOR.CLS_NORM: "cls.norm",
MODEL_TENSOR.CONV1D: "conv1d",
MODEL_TENSOR.CONVNEXT_DW: "convnext.{bid}.dw",
MODEL_TENSOR.CONVNEXT_NORM: "convnext.{bid}.norm",
@@ -1507,6 +1511,7 @@ MODEL_TENSORS: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
MODEL_TENSOR.FFN_NORM,
MODEL_TENSOR.CLS,
MODEL_TENSOR.CLS_OUT,
MODEL_TENSOR.CLS_NORM,
],
MODEL_ARCH.NOMIC_BERT: [
MODEL_TENSOR.TOKEN_EMBD,
@@ -2814,6 +2819,19 @@ MODEL_TENSORS: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
MODEL_TENSOR.FFN_GATE,
MODEL_TENSOR.FFN_UP,
],
MODEL_ARCH.JAIS2: [
MODEL_TENSOR.TOKEN_EMBD,
MODEL_TENSOR.OUTPUT_NORM,
MODEL_TENSOR.OUTPUT,
MODEL_TENSOR.ATTN_NORM,
MODEL_TENSOR.ATTN_Q,
MODEL_TENSOR.ATTN_K,
MODEL_TENSOR.ATTN_V,
MODEL_TENSOR.ATTN_OUT,
MODEL_TENSOR.FFN_NORM,
MODEL_TENSOR.FFN_DOWN,
MODEL_TENSOR.FFN_UP,
],
MODEL_ARCH.NEMOTRON: [
MODEL_TENSOR.TOKEN_EMBD,
MODEL_TENSOR.OUTPUT_NORM,
+7 -4
View File
@@ -175,6 +175,9 @@ class GGUFReader:
if new_align.types != [GGUFValueType.UINT32]:
raise ValueError('Bad type for general.alignment field')
self.alignment = new_align.parts[-1][0]
# Ensure alignment is a non-zero power of two
if self.alignment == 0 or (self.alignment & (self.alignment - 1)) != 0:
raise ValueError('Invalid alignment: must be a non-zero power of two')
padding = offs % self.alignment
if padding != 0:
offs += self.alignment - padding
@@ -202,11 +205,11 @@ class GGUFReader:
def _push_field(self, field: ReaderField, skip_sum: bool = False) -> int:
if field.name in self.fields:
# TODO: add option to generate error on duplicate keys
# raise KeyError(f'Duplicate {field.name} already in list at offset {field.offset}')
# TODO: add option to make this a warning and accept duplicate keys like below
raise KeyError(f'Duplicate {field.name} already in list at offset {field.offset}')
logger.warning(f'Duplicate key {field.name} at offset {field.offset}')
self.fields[field.name + '_{}'.format(field.offset)] = field
# logger.warning(f'Duplicate key {field.name} at offset {field.offset}')
# self.fields[field.name + '_{}'.format(field.offset)] = field
else:
self.fields[field.name] = field
return 0 if skip_sum else sum(int(part.nbytes) for part in field.parts)
+2
View File
@@ -501,6 +501,8 @@ class GGUFWriter:
self.add_uint32(Keys.General.QUANTIZATION_VERSION, quantization_version)
def add_custom_alignment(self, alignment: int) -> None:
if alignment <= 0 or (alignment & (alignment - 1)) != 0:
raise ValueError('Invalid alignment: must be a non-zero power of two')
self.data_alignment = alignment
self.add_uint32(Keys.General.ALIGNMENT, alignment)
+4
View File
@@ -1240,6 +1240,10 @@ class TensorNameMap:
MODEL_TENSOR.CLS_OUT: (
"classifier.out_proj", # roberta
),
MODEL_TENSOR.CLS_NORM: (
"head.norm", # modern-bert
),
#############################################################################
MODEL_TENSOR.CONVNEXT_DW: (
+1
View File
@@ -84,6 +84,7 @@ add_library(llama
models/hunyuan-moe.cpp
models/internlm2.cpp
models/jais.cpp
models/jais2.cpp
models/jamba.cpp
models/kimi-linear.cpp
models/lfm2.cpp
+18
View File
@@ -79,6 +79,7 @@ static const std::map<llm_arch, const char *> LLM_ARCH_NAMES = {
{ LLM_ARCH_T5, "t5" },
{ LLM_ARCH_T5ENCODER, "t5encoder" },
{ LLM_ARCH_JAIS, "jais" },
{ LLM_ARCH_JAIS2, "jais2" },
{ LLM_ARCH_NEMOTRON, "nemotron" },
{ LLM_ARCH_NEMOTRON_H, "nemotron_h" },
{ LLM_ARCH_NEMOTRON_H_MOE, "nemotron_h_moe" },
@@ -367,6 +368,7 @@ static const std::map<llm_tensor, const char *> LLM_TENSOR_NAMES = {
{ LLM_TENSOR_TOKEN_TYPES, "token_types" },
{ LLM_TENSOR_CLS, "cls" },
{ LLM_TENSOR_CLS_OUT, "cls.output" },
{ LLM_TENSOR_CLS_NORM, "cls.norm" },
{ LLM_TENSOR_ENC_OUTPUT_NORM, "enc.output_norm" },
{ LLM_TENSOR_FFN_GATE_INP_SHEXP, "blk.%d.ffn_gate_inp_shexp" },
{ LLM_TENSOR_SSM_A_NOSCAN, "blk.%d.ssm_a" },
@@ -828,6 +830,7 @@ static std::set<llm_tensor> llm_get_tensor_names(llm_arch arch) {
LLM_TENSOR_FFN_NORM,
LLM_TENSOR_CLS,
LLM_TENSOR_CLS_OUT,
LLM_TENSOR_CLS_NORM,
};
case LLM_ARCH_JINA_BERT_V2:
return {
@@ -1789,6 +1792,20 @@ static std::set<llm_tensor> llm_get_tensor_names(llm_arch arch) {
LLM_TENSOR_FFN_GATE,
LLM_TENSOR_FFN_DOWN,
};
case LLM_ARCH_JAIS2:
return {
LLM_TENSOR_TOKEN_EMBD,
LLM_TENSOR_OUTPUT_NORM,
LLM_TENSOR_OUTPUT,
LLM_TENSOR_ATTN_NORM,
LLM_TENSOR_ATTN_Q,
LLM_TENSOR_ATTN_K,
LLM_TENSOR_ATTN_V,
LLM_TENSOR_ATTN_OUT,
LLM_TENSOR_FFN_NORM,
LLM_TENSOR_FFN_UP,
LLM_TENSOR_FFN_DOWN,
};
case LLM_ARCH_NEMOTRON_H:
return {
LLM_TENSOR_TOKEN_EMBD,
@@ -2518,6 +2535,7 @@ static const std::map<llm_tensor, llm_tensor_info> LLM_TENSOR_INFOS = {
{LLM_TENSOR_OUTPUT, {LLM_TENSOR_LAYER_OUTPUT, GGML_OP_MUL_MAT}},
{LLM_TENSOR_CLS, {LLM_TENSOR_LAYER_OUTPUT, GGML_OP_MUL_MAT}},
{LLM_TENSOR_CLS_OUT, {LLM_TENSOR_LAYER_OUTPUT, GGML_OP_MUL_MAT}},
{LLM_TENSOR_CLS_NORM, {LLM_TENSOR_LAYER_OUTPUT, GGML_OP_MUL}},
{LLM_TENSOR_DENSE_2_OUT, {LLM_TENSOR_LAYER_OUTPUT, GGML_OP_MUL_MAT}}, // Dense layer output
{LLM_TENSOR_DENSE_3_OUT, {LLM_TENSOR_LAYER_OUTPUT, GGML_OP_MUL_MAT}}, // Dense layer output
{LLM_TENSOR_OUTPUT_NORM, {LLM_TENSOR_LAYER_OUTPUT, GGML_OP_MUL}},
+2
View File
@@ -83,6 +83,7 @@ enum llm_arch {
LLM_ARCH_T5,
LLM_ARCH_T5ENCODER,
LLM_ARCH_JAIS,
LLM_ARCH_JAIS2,
LLM_ARCH_NEMOTRON,
LLM_ARCH_NEMOTRON_H,
LLM_ARCH_NEMOTRON_H_MOE,
@@ -497,6 +498,7 @@ enum llm_tensor {
LLM_TENSOR_ENC_OUTPUT_NORM,
LLM_TENSOR_CLS,
LLM_TENSOR_CLS_OUT,
LLM_TENSOR_CLS_NORM,
LLM_TENSOR_CONV1D,
LLM_TENSOR_CONVNEXT_DW,
LLM_TENSOR_CONVNEXT_NORM,
+3 -44
View File
@@ -712,8 +712,6 @@ int64_t llama_context::output_resolve_row(int32_t i) const {
}
float * llama_context::get_logits_ith(int32_t i) {
int64_t j = -1;
output_reorder();
try {
@@ -721,26 +719,7 @@ float * llama_context::get_logits_ith(int32_t i) {
throw std::runtime_error("no logits");
}
// TODO: use output_resolve_row()
if (i < 0) {
j = n_outputs + i;
if (j < 0) {
throw std::runtime_error(format("negative index out of range [0, %d)", n_outputs));
}
} else if ((size_t) i >= output_ids.size()) {
throw std::runtime_error(format("out of range [0, %zu)", output_ids.size()));
} else {
j = output_ids[i];
}
if (j < 0) {
throw std::runtime_error(format("batch.logits[%d] != true", i));
}
if (j >= n_outputs) {
// This should not happen
throw std::runtime_error(format("corrupt output buffer (j=%" PRId64 ", n_outputs=%d)", j, n_outputs));
}
const int64_t j = output_resolve_row(i);
return logits.data + j*model.vocab.n_tokens();
} catch (const std::exception & err) {
LLAMA_LOG_ERROR("%s: invalid logits id %d, reason: %s\n", __func__, i, err.what());
@@ -763,8 +742,6 @@ llama_token * llama_context::get_sampled_tokens() const{
}
float * llama_context::get_embeddings_ith(int32_t i) {
int64_t j = -1;
output_reorder();
try {
@@ -772,26 +749,7 @@ float * llama_context::get_embeddings_ith(int32_t i) {
throw std::runtime_error("no embeddings");
}
// TODO: use output_resolve_row()
if (i < 0) {
j = n_outputs + i;
if (j < 0) {
throw std::runtime_error(format("negative index out of range [0, %d)", n_outputs));
}
} else if ((size_t) i >= output_ids.size()) {
throw std::runtime_error(format("out of range [0, %zu)", output_ids.size()));
} else {
j = output_ids[i];
}
if (j < 0) {
throw std::runtime_error(format("batch.logits[%d] != true", i));
}
if (j >= n_outputs) {
// This should not happen
throw std::runtime_error(format("corrupt output buffer (j=%" PRId64 ", n_outputs=%d)", j, n_outputs));
}
const int64_t j = output_resolve_row(i);
const uint32_t n_embd_out = model.hparams.n_embd_out();
return embd.data + j*n_embd_out;
} catch (const std::exception & err) {
@@ -2761,6 +2719,7 @@ void llama_context::opt_init(struct llama_model * model, struct llama_opt_params
llama_set_param(model->cls_b, param_filter, param_filter_ud);
llama_set_param(model->cls_out, param_filter, param_filter_ud);
llama_set_param(model->cls_out_b, param_filter, param_filter_ud);
llama_set_param(model->cls_norm, param_filter, param_filter_ud);
for (struct llama_layer & layer : model->layers) {
for (size_t i = 0; i < sizeof(layer)/sizeof(struct ggml_tensor *); ++i) {
+35 -11
View File
@@ -185,7 +185,10 @@ bool llm_graph_input_out_ids::can_reuse(const llm_graph_params & params) {
}
void llm_graph_input_mean::set_input(const llama_ubatch * ubatch) {
if (cparams.embeddings && cparams.pooling_type == LLAMA_POOLING_TYPE_MEAN) {
if (cparams.embeddings &&
(cparams.pooling_type == LLAMA_POOLING_TYPE_MEAN ||
cparams.pooling_type == LLAMA_POOLING_TYPE_RANK )) {
const int64_t n_tokens = ubatch->n_tokens;
const int64_t n_seq_tokens = ubatch->n_seq_tokens;
const int64_t n_seqs_unq = ubatch->n_seqs_unq;
@@ -1125,8 +1128,8 @@ ggml_tensor * llm_graph_context::build_ffn(
if (down) {
cur = build_lora_mm(down, cur);
if (arch == LLM_ARCH_GLM4 || arch == LLM_ARCH_GLM4_MOE) {
// GLM4 and GLM4_MOE seem to have numerical issues with half-precision accumulators
if (arch == LLM_ARCH_GLM4 || arch == LLM_ARCH_GLM4_MOE || arch == LLM_ARCH_JAIS2) {
// GLM4, GLM4_MOE, and JAIS2 seem to have numerical issues with half-precision accumulators
ggml_mul_mat_set_prec(cur, GGML_PREC_F32);
}
}
@@ -1721,7 +1724,8 @@ ggml_tensor * llm_graph_context::build_attn_mha(
ggml_tensor * cur;
if (cparams.flash_attn && kq_b == nullptr) {
const bool use_flash_attn = cparams.flash_attn && kq_b == nullptr;
if (use_flash_attn) {
GGML_ASSERT(kq_b == nullptr && "Flash attention does not support KQ bias yet");
if (v_trans) {
@@ -1981,8 +1985,8 @@ ggml_tensor * llm_graph_context::build_attn(
if (wo) {
cur = build_lora_mm(wo, cur);
if (arch == LLM_ARCH_GLM4 || arch == LLM_ARCH_GLM4_MOE) {
// GLM4 and GLM4_MOE seem to have numerical issues with half-precision accumulators
if (arch == LLM_ARCH_GLM4 || arch == LLM_ARCH_GLM4_MOE || arch == LLM_ARCH_JAIS2) {
// GLM4, GLM4_MOE, and JAIS2 seem to have numerical issues with half-precision accumulators
ggml_mul_mat_set_prec(cur, GGML_PREC_F32);
}
}
@@ -2414,8 +2418,9 @@ llm_graph_input_mem_hybrid_iswa * llm_graph_context::build_inp_mem_hybrid_iswa()
void llm_graph_context::build_dense_out(
ggml_tensor * dense_2,
ggml_tensor * dense_2_b,
ggml_tensor * dense_3) const {
if (!cparams.embeddings || !(dense_2 || dense_3)) {
if (!cparams.embeddings || !(dense_2 || dense_2_b || dense_3)) {
return;
}
ggml_tensor * cur = res->t_embd_pooled != nullptr ? res->t_embd_pooled : res->t_embd;
@@ -2424,6 +2429,9 @@ void llm_graph_context::build_dense_out(
if (dense_2) {
cur = ggml_mul_mat(ctx0, dense_2, cur);
}
if (dense_2_b) {
cur = ggml_add(ctx0, cur, dense_2_b);
}
if (dense_3) {
cur = ggml_mul_mat(ctx0, dense_3, cur);
}
@@ -2437,7 +2445,8 @@ void llm_graph_context::build_pooling(
ggml_tensor * cls,
ggml_tensor * cls_b,
ggml_tensor * cls_out,
ggml_tensor * cls_out_b) const {
ggml_tensor * cls_out_b,
ggml_tensor * cls_norm) const {
if (!cparams.embeddings) {
return;
}
@@ -2476,8 +2485,15 @@ void llm_graph_context::build_pooling(
} break;
case LLAMA_POOLING_TYPE_RANK:
{
ggml_tensor * inp_cls = build_inp_cls();
cur = ggml_get_rows(ctx0, inp, inp_cls);
if (arch == LLM_ARCH_MODERN_BERT) {
// modern bert gte reranker builds mean first then applies prediction head and classifier
// https://github.com/huggingface/transformers/blob/main/src/transformers/models/modernbert/modular_modernbert.py#L1404-1411
ggml_tensor * inp_mean = build_inp_mean();
cur = ggml_mul_mat(ctx0, ggml_cont(ctx0, ggml_transpose(ctx0, inp)), inp_mean);
} else {
ggml_tensor * inp_cls = build_inp_cls();
cur = ggml_get_rows(ctx0, inp, inp_cls);
}
// classification head
// https://github.com/huggingface/transformers/blob/5af7d41e49bbfc8319f462eb45253dcb3863dfb7/src/transformers/models/roberta/modeling_roberta.py#L1566
@@ -2486,7 +2502,15 @@ void llm_graph_context::build_pooling(
if (cls_b) {
cur = ggml_add(ctx0, cur, cls_b);
}
cur = ggml_tanh(ctx0, cur);
if (arch == LLM_ARCH_MODERN_BERT) {
cur = ggml_gelu(ctx0, cur);
} else {
cur = ggml_tanh(ctx0, cur);
}
if (cls_norm) {
// head norm
cur = build_norm(cur, cls_norm, NULL, LLM_NORM, -1);
}
}
// some models don't have `cls_out`, for example: https://huggingface.co/jinaai/jina-reranker-v1-tiny-en
+3 -1
View File
@@ -1000,7 +1000,8 @@ struct llm_graph_context {
ggml_tensor * cls,
ggml_tensor * cls_b,
ggml_tensor * cls_out,
ggml_tensor * cls_out_b) const;
ggml_tensor * cls_out_b,
ggml_tensor * cls_norm) const;
//
// sampling (backend sampling)
@@ -1014,6 +1015,7 @@ struct llm_graph_context {
void build_dense_out(
ggml_tensor * dense_2,
ggml_tensor * dense_2_b,
ggml_tensor * dense_3) const;
};
+1
View File
@@ -271,6 +271,7 @@ void llama_model_saver::add_tensors_from_model() {
add_tensor(model.cls_b);
add_tensor(model.cls_out);
add_tensor(model.cls_out_b);
add_tensor(model.cls_norm);
for (const struct llama_layer & layer : model.layers) {
for (size_t i = 0; i < sizeof(layer)/sizeof(struct ggml_tensor *); ++i) {
+74 -8
View File
@@ -908,7 +908,7 @@ void llama_model::load_hparams(llama_model_loader & ml) {
ml.get_key(LLM_KV_ROPE_FREQ_BASE_SWA, hparams.rope_freq_base_train_swa);
ml.get_key_or_arr(LLM_KV_ATTENTION_SLIDING_WINDOW_PATTERN, swa_period, false);
hparams.set_swa_pattern(swa_period);
hparams.set_swa_pattern(swa_period, true);
} else {
hparams.swa_type = LLAMA_SWA_TYPE_NONE;
}
@@ -1937,6 +1937,16 @@ void llama_model::load_hparams(llama_model_loader & ml) {
default: type = LLM_TYPE_UNKNOWN;
}
} break;
case LLM_ARCH_JAIS2:
{
ml.get_key(LLM_KV_ATTENTION_LAYERNORM_EPS, hparams.f_norm_eps);
switch (hparams.n_layer) {
case 32: type = LLM_TYPE_8B; break;
case 68: type = LLM_TYPE_70B; break;
default: type = LLM_TYPE_UNKNOWN;
}
} break;
case LLM_ARCH_NEMOTRON:
{
ml.get_key(LLM_KV_ATTENTION_LAYERNORM_EPS, hparams.f_norm_eps);
@@ -2348,6 +2358,12 @@ void llama_model::load_hparams(llama_model_loader & ml) {
case 10752: type = LLM_TYPE_2_6B; break;
default: type = LLM_TYPE_UNKNOWN;
}
if (const auto is_swa = ml.get_key(LLM_KV_ATTENTION_SLIDING_WINDOW, hparams.n_swa, false); is_swa && hparams.n_swa > 0) {
hparams.swa_type = LLAMA_SWA_TYPE_STANDARD;
for (uint32_t il = 0; il < hparams.n_layer; ++il) {
hparams.swa_layers[il] = !hparams.recurrent_layer_arr[il];
}
}
} break;
case LLM_ARCH_LFM2MOE:
{
@@ -3513,9 +3529,10 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
layer.ffn_norm = create_tensor(tn(LLM_TENSOR_FFN_NORM, "weight", i), {n_embd}, 0);
}
cls = create_tensor(tn(LLM_TENSOR_CLS, "weight"), {n_embd, n_embd}, TENSOR_NOT_REQUIRED);
cls_out = create_tensor(tn(LLM_TENSOR_CLS_OUT, "weight"), {n_embd, hparams.n_cls_out}, TENSOR_NOT_REQUIRED);
cls_out_b = create_tensor(tn(LLM_TENSOR_CLS_OUT, "bias"), {hparams.n_cls_out}, TENSOR_NOT_REQUIRED);
cls_out = create_tensor(tn(LLM_TENSOR_CLS_OUT, "weight"), {n_embd, hparams.n_cls_out}, TENSOR_NOT_REQUIRED);
cls_out_b = create_tensor(tn(LLM_TENSOR_CLS_OUT, "bias"), {hparams.n_cls_out}, TENSOR_NOT_REQUIRED);
cls = create_tensor(tn(LLM_TENSOR_CLS, "weight"), {n_embd, n_embd}, TENSOR_NOT_REQUIRED);
cls_norm = create_tensor(tn(LLM_TENSOR_CLS_NORM, "weight"), {n_embd}, TENSOR_NOT_REQUIRED);
} break;
case LLM_ARCH_NEO_BERT:
@@ -5368,6 +5385,45 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
layer.ffn_up_b = create_tensor(tn(LLM_TENSOR_FFN_UP, "bias", i), {n_ff}, 0);
}
} break;
case LLM_ARCH_JAIS2:
{
tok_embd = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}, 0);
// output
output_norm = create_tensor(tn(LLM_TENSOR_OUTPUT_NORM, "weight"), {n_embd}, 0);
output_norm_b = create_tensor(tn(LLM_TENSOR_OUTPUT_NORM, "bias"), {n_embd}, 0);
output = create_tensor(tn(LLM_TENSOR_OUTPUT, "weight"), {n_embd, n_vocab}, TENSOR_NOT_REQUIRED);
if (!output) {
output = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}, TENSOR_DUPLICATED);
}
for (int i = 0; i < n_layer; ++i) {
auto & layer = layers[i];
layer.attn_norm = create_tensor(tn(LLM_TENSOR_ATTN_NORM, "weight", i), {n_embd}, 0);
layer.attn_norm_b = create_tensor(tn(LLM_TENSOR_ATTN_NORM, "bias", i), {n_embd}, 0);
layer.wq = create_tensor(tn(LLM_TENSOR_ATTN_Q, "weight", i), {n_embd, n_embd_head_k * n_head}, 0);
layer.wk = create_tensor(tn(LLM_TENSOR_ATTN_K, "weight", i), {n_embd, n_embd_k_gqa}, 0);
layer.wv = create_tensor(tn(LLM_TENSOR_ATTN_V, "weight", i), {n_embd, n_embd_v_gqa}, 0);
layer.wo = create_tensor(tn(LLM_TENSOR_ATTN_OUT, "weight", i), {n_embd_head_k * n_head, n_embd}, 0);
// attention biases - all have shape n_embd (output dimension of projections)
layer.bq = create_tensor(tn(LLM_TENSOR_ATTN_Q, "bias", i), {n_embd}, 0);
layer.bk = create_tensor(tn(LLM_TENSOR_ATTN_K, "bias", i), {n_embd}, 0);
layer.bv = create_tensor(tn(LLM_TENSOR_ATTN_V, "bias", i), {n_embd}, 0);
layer.bo = create_tensor(tn(LLM_TENSOR_ATTN_OUT, "bias", i), {n_embd}, 0);
layer.ffn_norm = create_tensor(tn(LLM_TENSOR_FFN_NORM, "weight", i), {n_embd}, 0);
layer.ffn_norm_b = create_tensor(tn(LLM_TENSOR_FFN_NORM, "bias", i), {n_embd}, 0);
// Jais-2 uses simple MLP (no gate) with biases
layer.ffn_up = create_tensor(tn(LLM_TENSOR_FFN_UP, "weight", i), {n_embd, n_ff}, 0);
layer.ffn_up_b = create_tensor(tn(LLM_TENSOR_FFN_UP, "bias", i), {n_ff}, 0);
layer.ffn_down = create_tensor(tn(LLM_TENSOR_FFN_DOWN, "weight", i), {n_ff, n_embd}, 0);
layer.ffn_down_b = create_tensor(tn(LLM_TENSOR_FFN_DOWN, "bias", i), {n_embd}, 0);
}
} break;
case LLM_ARCH_CHATGLM:
{
tok_embd = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}, 0);
@@ -6895,7 +6951,8 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
}
// for LFM2-ColBert-350M
dense_2_out_layers = create_tensor(tn(LLM_TENSOR_DENSE_2_OUT, "weight"), {n_embd, hparams.n_embd_out()}, TENSOR_NOT_REQUIRED);
dense_2_out_layers = create_tensor(tn(LLM_TENSOR_DENSE_2_OUT, "weight"), {n_embd, hparams.n_embd_out()}, TENSOR_NOT_REQUIRED);
dense_2_out_layers_b = create_tensor(tn(LLM_TENSOR_DENSE_2_OUT, "bias"), {hparams.n_embd_out() }, TENSOR_NOT_REQUIRED);
} break;
case LLM_ARCH_SMALLTHINKER:
{
@@ -8553,6 +8610,10 @@ ggml_cgraph * llama_model::build_graph(const llm_graph_params & params) const {
{
llm = std::make_unique<llm_build_jais>(*this, params);
} break;
case LLM_ARCH_JAIS2:
{
llm = std::make_unique<llm_build_jais2>(*this, params);
} break;
case LLM_ARCH_NEMOTRON:
{
llm = std::make_unique<llm_build_nemotron>(*this, params);
@@ -8671,7 +8732,11 @@ ggml_cgraph * llama_model::build_graph(const llm_graph_params & params) const {
case LLM_ARCH_LFM2:
case LLM_ARCH_LFM2MOE:
{
llm = std::make_unique<llm_build_lfm2>(*this, params);
if (hparams.swa_type == LLAMA_SWA_TYPE_STANDARD) {
llm = std::make_unique<llm_build_lfm2<true>>(*this, params);
} else {
llm = std::make_unique<llm_build_lfm2<false>>(*this, params);
}
} break;
case LLM_ARCH_SMALLTHINKER:
{
@@ -8734,7 +8799,7 @@ ggml_cgraph * llama_model::build_graph(const llm_graph_params & params) const {
}
// add on pooling layer
llm->build_pooling(cls, cls_b, cls_out, cls_out_b);
llm->build_pooling(cls, cls_b, cls_out, cls_out_b, cls_norm);
// add backend sampling layers (if any)
llm->build_sampling();
@@ -8743,7 +8808,7 @@ ggml_cgraph * llama_model::build_graph(const llm_graph_params & params) const {
// there will be two additional dense projection layers
// dense linear projections are applied after pooling
// TODO: move reranking logic here and generalize
llm->build_dense_out(dense_2_out_layers, dense_3_out_layers);
llm->build_dense_out(dense_2_out_layers, dense_2_out_layers_b, dense_3_out_layers);
llm->res->set_outputs();
@@ -8961,6 +9026,7 @@ llama_rope_type llama_model_rope_type(const llama_model * model) {
case LLM_ARCH_BAILINGMOE2:
case LLM_ARCH_DOTS1:
case LLM_ARCH_HUNYUAN_MOE:
case LLM_ARCH_JAIS2:
case LLM_ARCH_OPENAI_MOE:
case LLM_ARCH_HUNYUAN_DENSE:
case LLM_ARCH_LFM2:
+4 -2
View File
@@ -475,6 +475,7 @@ struct llama_model {
struct ggml_tensor * cls_b = nullptr;
struct ggml_tensor * cls_out = nullptr;
struct ggml_tensor * cls_out_b = nullptr;
struct ggml_tensor * cls_norm = nullptr;
struct ggml_tensor * conv1d = nullptr;
struct ggml_tensor * conv1d_b = nullptr;
@@ -491,8 +492,9 @@ struct llama_model {
//Dense linear projections for SentenceTransformers models like embeddinggemma
// For Sentence Transformers models structure see
// https://sbert.net/docs/sentence_transformer/usage/custom_models.html#structure-of-sentence-transformer-models
struct ggml_tensor * dense_2_out_layers = nullptr;
struct ggml_tensor * dense_3_out_layers = nullptr;
struct ggml_tensor * dense_2_out_layers = nullptr;
struct ggml_tensor * dense_2_out_layers_b = nullptr;
struct ggml_tensor * dense_3_out_layers = nullptr;
// gguf metadata
std::unordered_map<std::string, std::string> gguf_kv;
+13 -1
View File
@@ -289,6 +289,15 @@ struct llm_tokenizer_bpe : llm_tokenizer {
"(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",
};
break;
case LLAMA_VOCAB_PRE_TYPE_JAIS2:
regex_exprs = {
// original regex from tokenizer.json
//"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s{512}(?!\\S)|\\s{256}(?!\\S)|\\s{128}(?!\\S)|\\s{64}(?!\\S)|\\s{32}(?!\\S)|\\s{16}(?!\\S)|\\s{8}(?!\\S)|\\s{4}(?!\\S)|\\s{1,2}(?!\\S)|\\s{1}",
// adapted: same as llama3 but with cascading whitespace pattern
"(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s{512}(?!\\S)|\\s{256}(?!\\S)|\\s{128}(?!\\S)|\\s{64}(?!\\S)|\\s{32}(?!\\S)|\\s{16}(?!\\S)|\\s{8}(?!\\S)|\\s{4}(?!\\S)|\\s{1,2}(?!\\S)|\\s{1}",
};
break;
case LLAMA_VOCAB_PRE_TYPE_DBRX:
case LLAMA_VOCAB_PRE_TYPE_SMAUG:
regex_exprs = {
@@ -1921,8 +1930,11 @@ void llama_vocab::impl::load(llama_model_loader & ml, const LLM_KV & kv) {
tokenizer_pre == "jina-v2-de" ||
tokenizer_pre == "a.x-4.0" ||
tokenizer_pre == "mellum" ||
tokenizer_pre == "modern-bert" ) {
tokenizer_pre == "modern-bert") {
pre_type = LLAMA_VOCAB_PRE_TYPE_GPT2;
} else if (
tokenizer_pre == "jais-2") {
pre_type = LLAMA_VOCAB_PRE_TYPE_JAIS2;
} else if (
tokenizer_pre == "jina-v1-en" ||
tokenizer_pre == "jina-v2-code" ||
+1
View File
@@ -57,6 +57,7 @@ enum llama_vocab_pre_type {
LLAMA_VOCAB_PRE_TYPE_QWEN35 = 46,
LLAMA_VOCAB_PRE_TYPE_TINY_AYA = 47,
LLAMA_VOCAB_PRE_TYPE_JOYAI_LLM = 48,
LLAMA_VOCAB_PRE_TYPE_JAIS2 = 49,
};
struct LLM_KV;
+123
View File
@@ -0,0 +1,123 @@
#include "models.h"
// JAIS-2 model graph builder
// Uses: LayerNorm (not RMSNorm), relu2 activation, separate Q/K/V, RoPE embeddings
llm_build_jais2::llm_build_jais2(const llama_model & model, const llm_graph_params & params) : llm_graph_context(params) {
const int64_t n_embd_head = hparams.n_embd_head_v;
GGML_ASSERT(n_embd_head == hparams.n_embd_head_k);
GGML_ASSERT(n_embd_head == hparams.n_rot);
ggml_tensor * cur;
ggml_tensor * inpL;
inpL = build_inp_embd(model.tok_embd);
// inp_pos - contains the positions
ggml_tensor * inp_pos = build_inp_pos();
// KV input for attention
auto * inp_attn = build_attn_inp_kv();
ggml_tensor * inp_out_ids = build_inp_out_ids();
for (int il = 0; il < n_layer; ++il) {
// Pre-attention LayerNorm
cur = build_norm(inpL,
model.layers[il].attn_norm,
model.layers[il].attn_norm_b,
LLM_NORM, il);
cb(cur, "attn_norm", il);
// Self-attention with separate Q, K, V projections
{
ggml_tensor * Qcur = build_lora_mm(model.layers[il].wq, cur);
cb(Qcur, "Qcur", il);
Qcur = ggml_add(ctx0, Qcur, model.layers[il].bq);
cb(Qcur, "Qcur_bias", il);
ggml_tensor * Kcur = build_lora_mm(model.layers[il].wk, cur);
cb(Kcur, "Kcur", il);
Kcur = ggml_add(ctx0, Kcur, model.layers[il].bk);
cb(Kcur, "Kcur_bias", il);
ggml_tensor * Vcur = build_lora_mm(model.layers[il].wv, cur);
cb(Vcur, "Vcur", il);
Vcur = ggml_add(ctx0, Vcur, model.layers[il].bv);
cb(Vcur, "Vcur_bias", il);
// Reshape for attention
Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens);
Kcur = ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens);
Vcur = ggml_reshape_3d(ctx0, Vcur, n_embd_head, n_head_kv, n_tokens);
// Apply RoPE
Qcur = ggml_rope_ext(
ctx0, Qcur, inp_pos, nullptr,
n_rot, rope_type, n_ctx_orig, freq_base, freq_scale,
ext_factor, attn_factor, beta_fast, beta_slow
);
Kcur = ggml_rope_ext(
ctx0, Kcur, inp_pos, nullptr,
n_rot, rope_type, n_ctx_orig, freq_base, freq_scale,
ext_factor, attn_factor, beta_fast, beta_slow
);
cb(Qcur, "Qcur_rope", il);
cb(Kcur, "Kcur_rope", il);
cur = build_attn(inp_attn,
model.layers[il].wo, model.layers[il].bo,
Qcur, Kcur, Vcur, nullptr, nullptr, nullptr, 1.0f/sqrtf(float(n_embd_head)), il);
}
if (il == n_layer - 1 && inp_out_ids) {
cur = ggml_get_rows(ctx0, cur, inp_out_ids);
inpL = ggml_get_rows(ctx0, inpL, inp_out_ids);
}
// Residual connection
ggml_tensor * ffn_inp = ggml_add(ctx0, cur, inpL);
cb(ffn_inp, "ffn_inp", il);
// Pre-FFN LayerNorm
cur = build_norm(ffn_inp,
model.layers[il].ffn_norm,
model.layers[il].ffn_norm_b,
LLM_NORM, il);
cb(cur, "ffn_norm", il);
// FFN with relu2 activation (ReLU squared) - no gate projection
// up -> relu2 -> down
cur = build_ffn(cur,
model.layers[il].ffn_up, model.layers[il].ffn_up_b, NULL,
NULL, NULL, NULL, // no gate
model.layers[il].ffn_down, model.layers[il].ffn_down_b, NULL,
NULL,
LLM_FFN_RELU_SQR, LLM_FFN_SEQ, il);
cb(cur, "ffn_out", il);
// Residual connection
inpL = ggml_add(ctx0, cur, ffn_inp);
inpL = build_cvec(inpL, il);
cb(inpL, "l_out", il);
}
// Final LayerNorm
cur = build_norm(inpL,
model.output_norm,
model.output_norm_b,
LLM_NORM, -1);
cb(cur, "result_norm", -1);
res->t_embd = cur;
// Output projection
cur = build_lora_mm(model.output, cur);
cb(cur, "result_output", -1);
res->t_logits = cur;
ggml_build_forward_expand(gf, cur);
}
+5 -7
View File
@@ -149,17 +149,19 @@ llm_build_kimi_linear::llm_build_kimi_linear(const llama_model & model, const ll
g1 = ggml_mul(ctx0, g1, A);
cb(g1, "kda_g1", il);
g1 = ggml_reshape_4d(ctx0, g1, head_dim, n_head, n_seq_tokens, n_seqs);
// Compute beta (mixing coefficient)
ggml_tensor * beta = ggml_mul_mat(ctx0, layer.ssm_beta, cur);
beta = ggml_reshape_4d(ctx0, beta, n_head, 1, n_seq_tokens, n_seqs);
beta = ggml_reshape_4d(ctx0, beta, 1, n_head, n_seq_tokens, n_seqs);
cb(beta, "kda_beta", il);
beta = ggml_sigmoid(ctx0, beta);
// Reshape for KDA recurrence
// {n_embd, n_tokens} -> {n_embd, n_seq_tokens, n_seqs}
cur = ggml_reshape_3d(ctx0, cur, cur->ne[0], n_seq_tokens, n_seqs);
g1 = ggml_reshape_4d(ctx0, g1, head_dim, n_head, n_seq_tokens, n_seqs);
// Get SSM state and compute KDA recurrence using ggml_kda_scan
ggml_tensor * ssm_states_all = mctx_cur->get_s_l(il);
ggml_tensor * state = build_rs(inp_rs, ssm_states_all, hparams.n_embd_s(), n_seqs);
@@ -169,10 +171,6 @@ llm_build_kimi_linear::llm_build_kimi_linear(const llama_model & model, const ll
Qcur = ggml_l2_norm(ctx0, Qcur, eps_norm);
Kcur = ggml_l2_norm(ctx0, Kcur, eps_norm);
beta = ggml_sigmoid(ctx0, beta);
beta = ggml_reshape_4d(ctx0, beta, 1, n_head, n_seq_tokens, n_seqs);
g1 = ggml_reshape_4d(ctx0, g1, head_dim, n_head, n_seq_tokens, n_seqs);
// Choose between build_delta_net_chunking and build_delta_net_recurrent based on n_tokens
std::pair<ggml_tensor *, ggml_tensor *> attn_out = n_seq_tokens == 1 ?
+138 -123
View File
@@ -1,18 +1,149 @@
#include "models.h"
#include "../llama-memory-hybrid-iswa.h"
#include "../llama-memory-hybrid.h"
template <bool iswa>
llm_build_lfm2<iswa>::llm_build_lfm2(const llama_model & model, const llm_graph_params & params) :
llm_graph_context(params) {
using inp_hybrid_type = std::conditional_t<iswa, llm_graph_input_mem_hybrid_iswa, llm_graph_input_mem_hybrid>;
using inp_attn_type = std::conditional_t<iswa, llm_graph_input_attn_kv_iswa, llm_graph_input_attn_kv>;
using mem_hybrid_ctx = std::conditional_t<iswa, llama_memory_hybrid_iswa_context, llama_memory_hybrid_context>;
llm_build_lfm2::llm_build_lfm2(const llama_model & model, const llm_graph_params & params) :
llm_graph_context(params),
model(model) {
// lambda helpers for readability
auto build_dense_feed_forward = [&model, this](ggml_tensor * cur, int il) -> ggml_tensor * {
GGML_ASSERT(!model.layers[il].ffn_up_b);
GGML_ASSERT(!model.layers[il].ffn_gate_b);
GGML_ASSERT(!model.layers[il].ffn_down_b);
return build_ffn(cur,
model.layers[il].ffn_up, NULL, NULL,
model.layers[il].ffn_gate, NULL, NULL,
model.layers[il].ffn_down, NULL, NULL,
NULL, LLM_FFN_SILU, LLM_FFN_PAR, il);
};
auto build_moe_feed_forward = [&model, this](ggml_tensor * cur, int il) -> ggml_tensor * {
return build_moe_ffn(cur,
model.layers[il].ffn_gate_inp, model.layers[il].ffn_up_exps,
model.layers[il].ffn_gate_exps, model.layers[il].ffn_down_exps,
model.layers[il].ffn_exp_probs_b, n_expert, n_expert_used, LLM_FFN_SILU, true, false, 0.0,
static_cast<llama_expert_gating_func_type>(hparams.expert_gating_func), il);
};
auto build_attn_block = [&model, this](ggml_tensor * cur,
ggml_tensor * inp_pos,
inp_attn_type * inp_attn,
int il) -> ggml_tensor * {
GGML_ASSERT(hparams.n_embd_v_gqa(il) == hparams.n_embd_k_gqa(il));
const auto n_embd_head = hparams.n_embd_head_v;
const auto n_head_kv = hparams.n_head_kv(il);
auto * q = build_lora_mm(model.layers[il].wq, cur);
cb(q, "model.layers.{}.self_attn.q_proj", il);
auto * k = build_lora_mm(model.layers[il].wk, cur);
cb(k, "model.layers.{}.self_attn.k_proj", il);
auto * v = build_lora_mm(model.layers[il].wv, cur);
cb(v, "model.layers.{}.self_attn.v_proj", il);
q = ggml_reshape_3d(ctx0, q, n_embd_head, n_head, n_tokens);
k = ggml_reshape_3d(ctx0, k, n_embd_head, n_head_kv, n_tokens);
v = ggml_reshape_3d(ctx0, v, n_embd_head, n_head_kv, n_tokens);
// qk norm
q = build_norm(q, model.layers[il].attn_q_norm, NULL, LLM_NORM_RMS, il);
cb(q, "model.layers.{}.self_attn.q_layernorm", il);
k = build_norm(k, model.layers[il].attn_k_norm, NULL, LLM_NORM_RMS, il);
cb(k, "model.layers.{}.self_attn.k_layernorm", il);
// RoPE
q = ggml_rope_ext(ctx0, q, inp_pos, nullptr, n_rot, rope_type, n_ctx_orig, freq_base, freq_scale, ext_factor,
attn_factor, beta_fast, beta_slow);
k = ggml_rope_ext(ctx0, k, inp_pos, nullptr, n_rot, rope_type, n_ctx_orig, freq_base, freq_scale, ext_factor,
attn_factor, beta_fast, beta_slow);
cur = build_attn(inp_attn,
model.layers[il].wo, NULL,
q, k, v, nullptr, nullptr, nullptr, 1.0f / sqrtf(float(n_embd_head)), il);
cb(cur, "model.layers.{}.self_attn.out_proj", il);
return cur;
};
auto build_shortconv_block = [&model, this](ggml_tensor * cur,
llm_graph_input_rs * inp_recr,
int il) -> ggml_tensor * {
const auto * mctx_cur = static_cast<const mem_hybrid_ctx *>(mctx)->get_recr();
const uint32_t kv_head = mctx_cur->get_head();
const int64_t n_seq_tokens = ubatch.n_seq_tokens;
const int64_t n_seqs = ubatch.n_seqs;
GGML_ASSERT(n_seqs != 0);
GGML_ASSERT(ubatch.equal_seqs());
GGML_ASSERT(ubatch.n_tokens == n_seq_tokens * n_seqs);
GGML_ASSERT(hparams.n_shortconv_l_cache > 1);
const uint32_t d_conv = hparams.n_shortconv_l_cache - 1;
// {n_embd, n_tokens} => {n_embd, n_seq_tokens, n_seqs}
cur = ggml_reshape_3d(ctx0, cur, cur->ne[0], n_seq_tokens, n_seqs);
auto * bcx = build_lora_mm(model.layers[il].shortconv.in_proj, cur);
cb(bcx, "model.layers.{}.conv.in_proj", il);
constexpr auto n_chunks = 3;
GGML_ASSERT(bcx->ne[0] % n_chunks == 0);
const auto chunk_size = bcx->ne[0] / n_chunks;
auto * b = ggml_view_3d(ctx0, bcx, chunk_size, bcx->ne[1], bcx->ne[2], bcx->nb[1], bcx->nb[2],
0 * chunk_size * ggml_element_size(bcx));
auto * c = ggml_view_3d(ctx0, bcx, chunk_size, bcx->ne[1], bcx->ne[2], bcx->nb[1], bcx->nb[2],
1 * chunk_size * ggml_element_size(bcx));
auto * x = ggml_view_3d(ctx0, bcx, chunk_size, bcx->ne[1], bcx->ne[2], bcx->nb[1], bcx->nb[2],
2 * chunk_size * ggml_element_size(bcx));
auto * bx = ggml_transpose(ctx0, ggml_mul(ctx0, b, x));
// read conv state
auto * conv_state = mctx_cur->get_r_l(il);
auto * conv_rs = build_rs(inp_recr, conv_state, hparams.n_embd_r(), n_seqs);
auto * conv = ggml_reshape_3d(ctx0, conv_rs, d_conv, hparams.n_embd, n_seqs);
bx = ggml_concat(ctx0, conv, bx, 0);
GGML_ASSERT(bx->ne[0] > conv->ne[0]);
// last d_conv columns is a new conv state
auto * new_conv = ggml_view_3d(ctx0, bx, conv->ne[0], bx->ne[1], bx->ne[2], bx->nb[1], bx->nb[2],
(bx->ne[0] - conv->ne[0]) * ggml_element_size(bx));
GGML_ASSERT(ggml_are_same_shape(conv, new_conv));
// write new conv conv state
ggml_build_forward_expand(gf, ggml_cpy(ctx0, new_conv,
ggml_view_1d(ctx0, conv_state, ggml_nelements(new_conv),
kv_head * d_conv * n_embd * ggml_element_size(new_conv))));
auto * conv_kernel = model.layers[il].shortconv.conv;
auto * conv_out = ggml_ssm_conv(ctx0, bx, conv_kernel);
cb(conv_out, "model.layers.{}.conv.conv", il);
auto * y = ggml_mul(ctx0, c, conv_out);
y = build_lora_mm(model.layers[il].shortconv.out_proj, y);
cb(y, "model.layers.{}.conv.out_proj", il);
// {n_embd, n_seq_tokens, n_seqs} => {n_embd, n_tokens}
y = ggml_reshape_2d(ctx0, y, y->ne[0], n_seq_tokens * n_seqs);
return y;
};
// actual graph construction starts here
ggml_tensor * cur = build_inp_embd(model.tok_embd);
cb(cur, "model.embed_tokens", -1);
ggml_build_forward_expand(gf, cur);
inp_hybrid_type * inp_hybrid = nullptr;
if constexpr (iswa) {
inp_hybrid = build_inp_mem_hybrid_iswa();
} else {
inp_hybrid = build_inp_mem_hybrid();
}
ggml_tensor * inp_pos = build_inp_pos();
auto * inp_hybrid = build_inp_mem_hybrid();
ggml_tensor * inp_out_ids = build_inp_out_ids();
for (int il = 0; il < n_layer; ++il) {
@@ -54,122 +185,6 @@ llm_build_lfm2::llm_build_lfm2(const llama_model & model, const llm_graph_params
ggml_build_forward_expand(gf, cur);
}
ggml_tensor * llm_build_lfm2::build_moe_feed_forward(ggml_tensor * cur, int il) const {
return build_moe_ffn(cur,
model.layers[il].ffn_gate_inp, model.layers[il].ffn_up_exps,
model.layers[il].ffn_gate_exps, model.layers[il].ffn_down_exps,
model.layers[il].ffn_exp_probs_b, n_expert, n_expert_used, LLM_FFN_SILU, true, false, 0.0,
static_cast<llama_expert_gating_func_type>(hparams.expert_gating_func), il);
}
ggml_tensor * llm_build_lfm2::build_dense_feed_forward(ggml_tensor * cur, int il) const {
GGML_ASSERT(!model.layers[il].ffn_up_b);
GGML_ASSERT(!model.layers[il].ffn_gate_b);
GGML_ASSERT(!model.layers[il].ffn_down_b);
return build_ffn(cur,
model.layers[il].ffn_up, NULL, NULL,
model.layers[il].ffn_gate, NULL, NULL,
model.layers[il].ffn_down, NULL, NULL,
NULL, LLM_FFN_SILU, LLM_FFN_PAR, il);
}
ggml_tensor * llm_build_lfm2::build_attn_block(ggml_tensor * cur,
ggml_tensor * inp_pos,
llm_graph_input_attn_kv * inp_attn,
int il) const {
GGML_ASSERT(hparams.n_embd_v_gqa(il) == hparams.n_embd_k_gqa(il));
const auto n_embd_head = hparams.n_embd_head_v;
const auto n_head_kv = hparams.n_head_kv(il);
auto * q = build_lora_mm(model.layers[il].wq, cur);
cb(q, "model.layers.{}.self_attn.q_proj", il);
auto * k = build_lora_mm(model.layers[il].wk, cur);
cb(k, "model.layers.{}.self_attn.k_proj", il);
auto * v = build_lora_mm(model.layers[il].wv, cur);
cb(v, "model.layers.{}.self_attn.v_proj", il);
q = ggml_reshape_3d(ctx0, q, n_embd_head, n_head, n_tokens);
k = ggml_reshape_3d(ctx0, k, n_embd_head, n_head_kv, n_tokens);
v = ggml_reshape_3d(ctx0, v, n_embd_head, n_head_kv, n_tokens);
// qk norm
q = build_norm(q, model.layers[il].attn_q_norm, NULL, LLM_NORM_RMS, il);
cb(q, "model.layers.{}.self_attn.q_layernorm", il);
k = build_norm(k, model.layers[il].attn_k_norm, NULL, LLM_NORM_RMS, il);
cb(k, "model.layers.{}.self_attn.k_layernorm", il);
// RoPE
q = ggml_rope_ext(ctx0, q, inp_pos, nullptr, n_rot, rope_type, n_ctx_orig, freq_base, freq_scale, ext_factor,
attn_factor, beta_fast, beta_slow);
k = ggml_rope_ext(ctx0, k, inp_pos, nullptr, n_rot, rope_type, n_ctx_orig, freq_base, freq_scale, ext_factor,
attn_factor, beta_fast, beta_slow);
cur = build_attn(inp_attn,
model.layers[il].wo, NULL,
q, k, v, nullptr, nullptr, nullptr, 1.0f / sqrtf(float(n_embd_head)), il);
cb(cur, "model.layers.{}.self_attn.out_proj", il);
return cur;
}
ggml_tensor * llm_build_lfm2::build_shortconv_block(ggml_tensor * cur, llm_graph_input_rs * inp_recr, int il) {
const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx)->get_recr();
const uint32_t kv_head = mctx_cur->get_head();
const int64_t n_seq_tokens = ubatch.n_seq_tokens;
const int64_t n_seqs = ubatch.n_seqs;
GGML_ASSERT(n_seqs != 0);
GGML_ASSERT(ubatch.equal_seqs());
GGML_ASSERT(ubatch.n_tokens == n_seq_tokens * n_seqs);
GGML_ASSERT(hparams.n_shortconv_l_cache > 1);
const uint32_t d_conv = hparams.n_shortconv_l_cache - 1;
// {n_embd, n_tokens} => {n_embd, n_seq_tokens, n_seqs}
cur = ggml_reshape_3d(ctx0, cur, cur->ne[0], n_seq_tokens, n_seqs);
auto * bcx = build_lora_mm(model.layers[il].shortconv.in_proj, cur);
cb(bcx, "model.layers.{}.conv.in_proj", il);
constexpr auto n_chunks = 3;
GGML_ASSERT(bcx->ne[0] % n_chunks == 0);
const auto chunk_size = bcx->ne[0] / n_chunks;
auto * b = ggml_view_3d(ctx0, bcx, chunk_size, bcx->ne[1], bcx->ne[2], bcx->nb[1], bcx->nb[2],
0 * chunk_size * ggml_element_size(bcx));
auto * c = ggml_view_3d(ctx0, bcx, chunk_size, bcx->ne[1], bcx->ne[2], bcx->nb[1], bcx->nb[2],
1 * chunk_size * ggml_element_size(bcx));
auto * x = ggml_view_3d(ctx0, bcx, chunk_size, bcx->ne[1], bcx->ne[2], bcx->nb[1], bcx->nb[2],
2 * chunk_size * ggml_element_size(bcx));
auto * bx = ggml_transpose(ctx0, ggml_mul(ctx0, b, x));
// read conv state
auto * conv_state = mctx_cur->get_r_l(il);
auto * conv_rs = build_rs(inp_recr, conv_state, hparams.n_embd_r(), n_seqs);
auto * conv = ggml_reshape_3d(ctx0, conv_rs, d_conv, hparams.n_embd, n_seqs);
bx = ggml_concat(ctx0, conv, bx, 0);
GGML_ASSERT(bx->ne[0] > conv->ne[0]);
// last d_conv columns is a new conv state
auto * new_conv = ggml_view_3d(ctx0, bx, conv->ne[0], bx->ne[1], bx->ne[2], bx->nb[1], bx->nb[2],
(bx->ne[0] - conv->ne[0]) * ggml_element_size(bx));
GGML_ASSERT(ggml_are_same_shape(conv, new_conv));
// write new conv conv state
ggml_build_forward_expand(gf, ggml_cpy(ctx0, new_conv,
ggml_view_1d(ctx0, conv_state, ggml_nelements(new_conv),
kv_head * d_conv * n_embd * ggml_element_size(new_conv))));
auto * conv_kernel = model.layers[il].shortconv.conv;
auto * conv_out = ggml_ssm_conv(ctx0, bx, conv_kernel);
cb(conv_out, "model.layers.{}.conv.conv", il);
auto * y = ggml_mul(ctx0, c, conv_out);
y = build_lora_mm(model.layers[il].shortconv.out_proj, y);
cb(y, "model.layers.{}.conv.out_proj", il);
// {n_embd, n_seq_tokens, n_seqs} => {n_embd, n_tokens}
y = ggml_reshape_2d(ctx0, y, y->ne[0], n_seq_tokens * n_seqs);
return y;
}
// Explicit template instantiations
template struct llm_build_lfm2<true>;
template struct llm_build_lfm2<false>;
+5 -7
View File
@@ -316,6 +316,10 @@ struct llm_build_jais : public llm_graph_context {
llm_build_jais(const llama_model & model, const llm_graph_params & params);
};
struct llm_build_jais2 : public llm_graph_context {
llm_build_jais2(const llama_model & model, const llm_graph_params & params);
};
struct llm_build_jamba : public llm_build_mamba_base {
llm_build_jamba(const llama_model & model, const llm_graph_params & params);
};
@@ -347,15 +351,9 @@ struct llm_build_kimi_linear : public llm_build_delta_net_base {
const llama_model & model;
};
template <bool iswa>
struct llm_build_lfm2 : public llm_graph_context {
const llama_model & model;
llm_build_lfm2(const llama_model & model, const llm_graph_params & params);
ggml_tensor * build_moe_feed_forward(ggml_tensor * cur, int il) const;
ggml_tensor * build_dense_feed_forward(ggml_tensor * cur, int il) const;
ggml_tensor * build_attn_block(ggml_tensor * cur, ggml_tensor * inp_pos, llm_graph_input_attn_kv * inp_attn, int il) const;
ggml_tensor * build_shortconv_block(ggml_tensor * cur, llm_graph_input_rs * inp_recr, int il);
};
struct llm_build_llada : public llm_graph_context {
-7
View File
@@ -104,13 +104,6 @@ llm_build_modern_bert::llm_build_modern_bert(const llama_model & model, const ll
LLM_NORM, -1);
cb(cur, "final_norm_out", -1);
if (hparams.pooling_type == LLAMA_POOLING_TYPE_CLS) {
// extracting cls token
cur = ggml_view_1d(ctx0, cur, hparams.n_embd, 0);
cb(cur, "cls_pooled_embd", -1);
}
cb(cur, "res_embd", -1);
res->t_embd = cur;
ggml_build_forward_expand(gf, cur);
}
+3 -1
View File
@@ -216,7 +216,7 @@ ggml_tensor * llm_build_qwen35::build_layer_attn_linear(
ggml_tensor * z = qkvz.second;
ggml_tensor * beta = build_lora_mm(model.layers[il].ssm_beta, cur);
beta = ggml_reshape_4d(ctx0, beta, num_v_heads, 1, n_seq_tokens, n_seqs);
beta = ggml_reshape_4d(ctx0, beta, 1, num_v_heads, n_seq_tokens, n_seqs);
cb(beta, "beta", il);
beta = ggml_sigmoid(ctx0, beta);
@@ -232,6 +232,8 @@ ggml_tensor * llm_build_qwen35::build_layer_attn_linear(
ggml_tensor * gate = ggml_mul(ctx0, alpha_softplus, model.layers[il].ssm_a); // -A_log.exp() * softplus
cb(gate, "gate", il);
gate = ggml_reshape_4d(ctx0, gate, 1, num_v_heads, n_seq_tokens, n_seqs);
// Get convolution states from cache
ggml_tensor * conv_states_all = mctx_cur->get_r_l(il);
ggml_tensor * ssm_states_all = mctx_cur->get_s_l(il);
+3 -1
View File
@@ -216,7 +216,7 @@ ggml_tensor * llm_build_qwen35moe ::build_layer_attn_linear(
ggml_tensor * z = qkvz.second;
ggml_tensor * beta = build_lora_mm(model.layers[il].ssm_beta, cur);
beta = ggml_reshape_4d(ctx0, beta, num_v_heads, 1, n_seq_tokens, n_seqs);
beta = ggml_reshape_4d(ctx0, beta, 1, num_v_heads, n_seq_tokens, n_seqs);
cb(beta, "beta", il);
beta = ggml_sigmoid(ctx0, beta);
@@ -232,6 +232,8 @@ ggml_tensor * llm_build_qwen35moe ::build_layer_attn_linear(
ggml_tensor * gate = ggml_mul(ctx0, alpha_softplus, model.layers[il].ssm_a); // -A_log.exp() * softplus
cb(gate, "gate", il);
gate = ggml_reshape_4d(ctx0, gate, 1, num_v_heads, n_seq_tokens, n_seqs);
// Get convolution states from cache
ggml_tensor * conv_states_all = mctx_cur->get_r_l(il);
ggml_tensor * ssm_states_all = mctx_cur->get_s_l(il);
+15 -1
View File
@@ -48,6 +48,7 @@ enum handcrafted_file_type {
HANDCRAFTED_DATA_NOT_ENOUGH_DATA = 10 + offset_has_data,
HANDCRAFTED_DATA_BAD_ALIGN = 15 + offset_has_data,
HANDCRAFTED_DATA_INCONSISTENT_ALIGN = 20 + offset_has_data,
HANDCRAFTED_DATA_MEM_SIZE_OVERFLOW = 30 + offset_has_data,
HANDCRAFTED_DATA_SUCCESS = 800 + offset_has_data,
HANDCRAFTED_DATA_CUSTOM_ALIGN = 810 + offset_has_data,
};
@@ -84,6 +85,7 @@ static std::string handcrafted_file_type_name(const enum handcrafted_file_type h
case HANDCRAFTED_DATA_NOT_ENOUGH_DATA: return "DATA_NOT_ENOUGH_DATA";
case HANDCRAFTED_DATA_BAD_ALIGN: return "DATA_BAD_ALIGN";
case HANDCRAFTED_DATA_INCONSISTENT_ALIGN: return "DATA_INCONSISTENT_ALIGN";
case HANDCRAFTED_DATA_MEM_SIZE_OVERFLOW: return "DATA_MEM_SIZE_OVERFLOW";
case HANDCRAFTED_DATA_SUCCESS: return "DATA_SUCCESS";
case HANDCRAFTED_DATA_CUSTOM_ALIGN: return "DATA_CUSTOM_ALIGN";
}
@@ -196,6 +198,13 @@ static FILE * get_handcrafted_file(const unsigned int seed, const enum handcraft
tensor_configs = get_tensor_configs(rng);
}
if (hft == HANDCRAFTED_DATA_MEM_SIZE_OVERFLOW) {
tensor_configs.resize(2);
tensor_configs[0] = { GGML_TYPE_I8, { 0x7FFFFFFFFFFFFFC0, 1, 1, 1 } };
tensor_configs[1] = { GGML_TYPE_I8, { 0x7FFFFFFFFFFFFFC0, 1, 1, 1 } };
}
if (hft == HANDCRAFTED_HEADER_BAD_N_TENSORS) {
const uint64_t n_tensors = -1;
helper_write(file, n_tensors);
@@ -397,7 +406,8 @@ static FILE * get_handcrafted_file(const unsigned int seed, const enum handcraft
for (uint32_t i = 1; i < n_dims; ++i) {
ne *= shape[i];
}
offset += GGML_PAD(ggml_row_size(type, ne), alignment);
offset += GGML_PAD(ggml_row_size(type, ne), (uint64_t) alignment);
}
while (ftell(file) % alignment != 0) {
@@ -411,6 +421,9 @@ static FILE * get_handcrafted_file(const unsigned int seed, const enum handcraft
if (hft == HANDCRAFTED_DATA_NOT_ENOUGH_DATA) {
nbytes -= 1;
}
if (hft == HANDCRAFTED_DATA_MEM_SIZE_OVERFLOW) {
nbytes = 32;
}
for (uint64_t i = 0; i < nbytes; ++i) {
const uint8_t random_byte = i % 256;
helper_write(file, random_byte);
@@ -704,6 +717,7 @@ static std::pair<int, int> test_handcrafted_file(const unsigned int seed) {
HANDCRAFTED_DATA_NOT_ENOUGH_DATA,
HANDCRAFTED_DATA_BAD_ALIGN,
HANDCRAFTED_DATA_INCONSISTENT_ALIGN,
HANDCRAFTED_DATA_MEM_SIZE_OVERFLOW,
HANDCRAFTED_DATA_SUCCESS,
HANDCRAFTED_DATA_CUSTOM_ALIGN,
};
+1 -4
View File
@@ -628,9 +628,6 @@ ggml_tensor * clip_graph::build_attn(
ggml_tensor * v = ggml_permute(ctx0, v_cur, 1, 2, 0, 3);
v = ggml_cont(ctx0, v);
const auto n_tokens = q->ne[1];
const auto n_head = q->ne[2];
ggml_tensor * kq = ggml_mul_mat(ctx0, k, q);
// F32 may not needed for vision encoders?
// ggml_mul_mat_set_prec(kq, GGML_PREC_F32);
@@ -639,7 +636,7 @@ ggml_tensor * clip_graph::build_attn(
ggml_tensor * kqv = ggml_mul_mat(ctx0, v, kq);
cur = ggml_permute(ctx0, kqv, 0, 2, 1, 3);
cur = ggml_cont_2d(ctx0, cur, cur->ne[0]*n_head, n_tokens);
cur = ggml_cont_2d(ctx0, cur, cur->ne[0] * cur->ne[1], cur->ne[2] * cur->ne[3]);
}
cb(cur, "kqv_out", il);
+1 -1
View File
@@ -175,7 +175,7 @@ struct mtmd_context {
clip_context_params ctx_clip_params {
/* use_gpu */ ctx_params.use_gpu,
/* flash_attn_type */ CLIP_FLASH_ATTN_TYPE_AUTO,
/* flash_attn_type */ mtmd_get_clip_flash_attn_type(ctx_params.flash_attn_type),
/* image_min_tokens */ ctx_params.image_min_tokens,
/* image_max_tokens */ ctx_params.image_max_tokens,
/* warmup */ ctx_params.warmup,
+9
View File
@@ -28,6 +28,14 @@ if [ "${1:-}" = "huge" ]; then
echo "Include BIG and HUGE models..."
fi
# Check if the second argument is "flash", then enable flash attention
# This is useful to test if flash attention off works correctly
FLASH_ATTN="on"
if [ "${2:-}" = "flash_off" ] || [ "${1:-}" = "flash_off" ]; then
FLASH_ATTN="off"
echo "Flash attention disabled..."
fi
###############
arr_prefix=()
@@ -143,6 +151,7 @@ for i in "${!arr_hf[@]}"; do
-hf $(printf %q "$hf") \
--image $(printf %q "$SCRIPT_DIR/$inp_file") \
--temp 0 -n 128 \
--flash-attn $(printf %q "$FLASH_ATTN") \
${extra_args}"
# if extra_args does not contain -p, we add a default prompt
Binary file not shown.
+2 -4
View File
@@ -916,8 +916,7 @@ json oaicompat_chat_params_parse(
json image_url = json_value(p, "image_url", json::object());
handle_media(out_files, image_url, opt.media_path);
// replace this chunk with a marker
p["type"] = "text";
p["type"] = "media_marker";
p["text"] = mtmd_default_marker();
p.erase("image_url");
@@ -938,8 +937,7 @@ json oaicompat_chat_params_parse(
// TODO: add audio_url support by reusing handle_media()
// replace this chunk with a marker
p["type"] = "text";
p["type"] = "media_marker";
p["text"] = mtmd_default_marker();
p.erase("input_audio");
@@ -498,7 +498,8 @@ class ChatStore {
MessageRole.USER,
content,
MessageType.TEXT,
parentIdForUserMessage ?? '-1'
parentIdForUserMessage ?? '-1',
extras
);
if (isNewConversation && content)
await conversationsStore.updateConversationName(currentConv.id, content.trim());