common : avoid logging partial messages (which can contain broken UTF-8 sequences) (#14937 )

* bug-fix: don't attempt to log partial parsed messages to avoid crash due to unfinished UTF-8 sequences
CANN: Add ggml_set_rows (#14943 )
2026-07-02 10:37:43 +02:00 · 2025-07-29 17:05:38 +02:00 · 2025-07-29 22:36:43 +08:00 · 2025-07-29 14:22:03 +02:00 · 2025-07-29 10:40:50 +02:00 · 2025-07-29 14:45:18 +08:00
13 changed files with 8645 additions and 168 deletions
@@ -1944,6 +1944,8 @@ common_chat_msg common_chat_parse(const std::string & input, bool is_partial, co
        }
    }
    auto msg = builder.result();
-    LOG_DBG("Parsed message: %s\n", common_chat_msgs_to_json_oaicompat<json>({msg}).at(0).dump().c_str());
+    if (!is_partial) {
+        LOG_DBG("Parsed message: %s\n", common_chat_msgs_to_json_oaicompat<json>({msg}).at(0).dump().c_str());
+    }
    return msg;
 }
@@ -12,91 +12,91 @@ Legend:
 - 🟡 Partially supported by this backend
 - ❌ Not supported by this backend

-| Operation | BLAS | CPU | CUDA | Metal | SYCL | Vulkan |
-|-----------|------|------|------|------|------|------|
-|                              ABS | ❌ | ✅ | 🟡 | 🟡 | 🟡 | ❌ |
-|                              ACC | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
-|                              ADD | ❌ | ✅ | ✅ | 🟡 | ✅ | ✅ |
-|                             ADD1 | ❌ | ✅ | ✅ | ❌ | ✅ | ❌ |
-|                           ARANGE | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ |
-|                           ARGMAX | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
-|                          ARGSORT | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
-|                            CLAMP | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 |
-|                           CONCAT | ❌ | ✅ | 🟡 | ✅ | 🟡 | ✅ |
-|                             CONT | ❌ | ✅ | ✅ | ✅ | 🟡 | 🟡 |
-|                          CONV_2D | ❌ | ✅ | ❌ | ❌ | ❌ | ✅ |
-|                       CONV_2D_DW | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ |
-|                CONV_TRANSPOSE_1D | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
-|                CONV_TRANSPOSE_2D | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
-|                              COS | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 |
-|                      COUNT_EQUAL | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ |
-|                              CPY | ❌ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 |
-|               CROSS_ENTROPY_LOSS | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
-|          CROSS_ENTROPY_LOSS_BACK | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
-|                    DIAG_MASK_INF | ❌ | ✅ | ✅ | 🟡 | ✅ | ✅ |
-|                              DIV | ❌ | ✅ | ✅ | 🟡 | ✅ | ✅ |
-|                              DUP | ❌ | ✅ | 🟡 | 🟡 | ✅ | 🟡 |
-|                              ELU | ❌ | ✅ | 🟡 | 🟡 | 🟡 | ❌ |
-|                              EXP | ❌ | ✅ | 🟡 | 🟡 | 🟡 | ❌ |
-|                   FLASH_ATTN_EXT | ❌ | ✅ | 🟡 | 🟡 | ❌ | 🟡 |
-|                GATED_LINEAR_ATTN | ❌ | ✅ | ✅ | ❌ | ✅ | ❌ |
-|                            GEGLU | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 |
-|                        GEGLU_ERF | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 |
-|                      GEGLU_QUICK | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 |
-|                             GELU | ❌ | ✅ | 🟡 | 🟡 | 🟡 | 🟡 |
-|                         GELU_ERF | ❌ | ✅ | 🟡 | 🟡 | 🟡 | 🟡 |
-|                       GELU_QUICK | ❌ | ✅ | 🟡 | 🟡 | 🟡 | 🟡 |
-|                         GET_ROWS | ❌ | ✅ | 🟡 | ✅ | 🟡 | 🟡 |
-|                    GET_ROWS_BACK | ❌ | 🟡 | 🟡 | ❌ | ❌ | ❌ |
-|                       GROUP_NORM | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
-|                      HARDSIGMOID | ❌ | ✅ | 🟡 | 🟡 | 🟡 | ❌ |
-|                        HARDSWISH | ❌ | ✅ | 🟡 | 🟡 | 🟡 | ❌ |
-|                           IM2COL | ❌ | ✅ | ✅ | 🟡 | ✅ | ✅ |
-|                          L2_NORM | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
-|                       LEAKY_RELU | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
-|                              LOG | ❌ | ✅ | ✅ | ❌ | ✅ | ❌ |
-|                             MEAN | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ |
-|                              MUL | ❌ | ✅ | ✅ | 🟡 | ✅ | ✅ |
-|                          MUL_MAT | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 |
-|                       MUL_MAT_ID | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ |
-|                              NEG | ❌ | ✅ | 🟡 | 🟡 | 🟡 | ❌ |
-|                             NORM | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 |
-|                   OPT_STEP_ADAMW | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ |
-|                         OUT_PROD | 🟡 | 🟡 | 🟡 | ❌ | 🟡 | ❌ |
-|                              PAD | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
-|                   PAD_REFLECT_1D | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ |
-|                          POOL_2D | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
-|                            REGLU | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 |
-|                             RELU | ❌ | ✅ | 🟡 | 🟡 | 🟡 | 🟡 |
-|                           REPEAT | ❌ | ✅ | 🟡 | ✅ | ✅ | 🟡 |
-|                      REPEAT_BACK | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ |
-|                         RMS_NORM | ❌ | ✅ | ✅ | 🟡 | ✅ | ✅ |
-|                    RMS_NORM_BACK | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ |
-|                 RMS_NORM_MUL_ADD | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
-|                             ROLL | ❌ | ✅ | ❌ | ❌ | ❌ | ✅ |
-|                             ROPE | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
-|                        ROPE_BACK | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ |
-|                        RWKV_WKV6 | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
-|                        RWKV_WKV7 | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
-|                            SCALE | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
-|                              SET | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ |
-|                         SET_ROWS | ❌ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 |
-|                              SGN | ❌ | ✅ | 🟡 | 🟡 | 🟡 | ❌ |
-|                          SIGMOID | ❌ | ✅ | 🟡 | 🟡 | 🟡 | 🟡 |
-|                             SILU | ❌ | ✅ | 🟡 | 🟡 | 🟡 | 🟡 |
-|                        SILU_BACK | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ |
-|                              SIN | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 |
-|                         SOFT_MAX | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ |
-|                    SOFT_MAX_BACK | ❌ | 🟡 | 🟡 | ❌ | ❌ | ✅ |
-|                              SQR | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 |
-|                             SQRT | ❌ | ✅ | ✅ | 🟡 | ✅ | ❌ |
-|                         SSM_CONV | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ |
-|                         SSM_SCAN | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ |
-|                             STEP | ❌ | ✅ | 🟡 | 🟡 | 🟡 | ❌ |
-|                              SUB | ❌ | ✅ | ✅ | 🟡 | ✅ | ✅ |
-|                              SUM | ❌ | ✅ | ✅ | ❌ | ✅ | ✅ |
-|                         SUM_ROWS | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
-|                           SWIGLU | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 |
-|                             TANH | ❌ | ✅ | 🟡 | 🟡 | 🟡 | 🟡 |
-|               TIMESTEP_EMBEDDING | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
-|                          UPSCALE | ❌ | ✅ | ✅ | 🟡 | 🟡 | ✅ |
+| Operation | BLAS | CPU | CUDA | Metal | OpenCL | SYCL | Vulkan |
+|-----------|------|------|------|------|------|------|------|
+|                              ABS | ❌ | ✅ | 🟡 | 🟡 | ❌ | 🟡 | ❌ |
+|                              ACC | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
+|                              ADD | ❌ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ✅ |
+|                             ADD1 | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ |
+|                           ARANGE | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
+|                           ARGMAX | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
+|                          ARGSORT | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+|                            CLAMP | ❌ | ✅ | ✅ | 🟡 | 🟡 | ✅ | 🟡 |
+|                           CONCAT | ❌ | ✅ | 🟡 | ✅ | 🟡 | 🟡 | ✅ |
+|                             CONT | ❌ | ✅ | ✅ | ✅ | 🟡 | 🟡 | 🟡 |
+|                          CONV_2D | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ✅ |
+|                       CONV_2D_DW | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ |
+|                CONV_TRANSPOSE_1D | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
+|                CONV_TRANSPOSE_2D | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
+|                              COS | ❌ | ✅ | ✅ | 🟡 | ❌ | ✅ | 🟡 |
+|                      COUNT_EQUAL | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ |
+|                              CPY | ❌ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 |
+|               CROSS_ENTROPY_LOSS | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
+|          CROSS_ENTROPY_LOSS_BACK | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
+|                    DIAG_MASK_INF | ❌ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ✅ |
+|                              DIV | ❌ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ✅ |
+|                              DUP | ❌ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 |
+|                              ELU | ❌ | ✅ | 🟡 | 🟡 | ❌ | 🟡 | ❌ |
+|                              EXP | ❌ | ✅ | 🟡 | 🟡 | ❌ | 🟡 | ❌ |
+|                   FLASH_ATTN_EXT | ❌ | ✅ | 🟡 | 🟡 | ❌ | ❌ | 🟡 |
+|                GATED_LINEAR_ATTN | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ |
+|                            GEGLU | ❌ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 |
+|                        GEGLU_ERF | ❌ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 |
+|                      GEGLU_QUICK | ❌ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 |
+|                             GELU | ❌ | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 |
+|                         GELU_ERF | ❌ | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 |
+|                       GELU_QUICK | ❌ | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 |
+|                         GET_ROWS | ❌ | ✅ | 🟡 | ✅ | 🟡 | 🟡 | 🟡 |
+|                    GET_ROWS_BACK | ❌ | 🟡 | 🟡 | ❌ | ❌ | ❌ | ❌ |
+|                       GROUP_NORM | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+|                      HARDSIGMOID | ❌ | ✅ | 🟡 | 🟡 | ❌ | 🟡 | ❌ |
+|                        HARDSWISH | ❌ | ✅ | 🟡 | 🟡 | ❌ | 🟡 | ❌ |
+|                           IM2COL | ❌ | ✅ | ✅ | 🟡 | ✅ | ✅ | ✅ |
+|                          L2_NORM | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
+|                       LEAKY_RELU | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
+|                              LOG | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ |
+|                             MEAN | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
+|                              MUL | ❌ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ✅ |
+|                          MUL_MAT | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 |
+|                       MUL_MAT_ID | ❌ | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ |
+|                              NEG | ❌ | ✅ | 🟡 | 🟡 | ❌ | 🟡 | ❌ |
+|                             NORM | ❌ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 |
+|                   OPT_STEP_ADAMW | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ |
+|                         OUT_PROD | 🟡 | 🟡 | 🟡 | ❌ | ❌ | 🟡 | ❌ |
+|                              PAD | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+|                   PAD_REFLECT_1D | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ |
+|                          POOL_2D | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
+|                            REGLU | ❌ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 |
+|                             RELU | ❌ | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 |
+|                           REPEAT | ❌ | ✅ | 🟡 | ✅ | 🟡 | ✅ | 🟡 |
+|                      REPEAT_BACK | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ |
+|                         RMS_NORM | ❌ | ✅ | ✅ | 🟡 | ✅ | ✅ | ✅ |
+|                    RMS_NORM_BACK | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ |
+|                 RMS_NORM_MUL_ADD | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+|                             ROLL | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ |
+|                             ROPE | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+|                        ROPE_BACK | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ |
+|                        RWKV_WKV6 | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
+|                        RWKV_WKV7 | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
+|                            SCALE | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+|                              SET | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ |
+|                         SET_ROWS | ❌ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 |
+|                              SGN | ❌ | ✅ | 🟡 | 🟡 | ❌ | 🟡 | ❌ |
+|                          SIGMOID | ❌ | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 |
+|                             SILU | ❌ | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 |
+|                        SILU_BACK | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ |
+|                              SIN | ❌ | ✅ | ✅ | 🟡 | ❌ | ✅ | 🟡 |
+|                         SOFT_MAX | ❌ | ✅ | ✅ | ✅ | ✅ | 🟡 | ✅ |
+|                    SOFT_MAX_BACK | ❌ | 🟡 | 🟡 | ❌ | ❌ | ❌ | ✅ |
+|                              SQR | ❌ | ✅ | ✅ | 🟡 | ❌ | ✅ | 🟡 |
+|                             SQRT | ❌ | ✅ | ✅ | 🟡 | ❌ | ✅ | ❌ |
+|                         SSM_CONV | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
+|                         SSM_SCAN | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
+|                             STEP | ❌ | ✅ | 🟡 | 🟡 | ❌ | 🟡 | ❌ |
+|                              SUB | ❌ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ✅ |
+|                              SUM | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ |
+|                         SUM_ROWS | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+|                           SWIGLU | ❌ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 |
+|                             TANH | ❌ | ✅ | 🟡 | 🟡 | ✅ | 🟡 | 🟡 |
+|               TIMESTEP_EMBEDDING | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+|                          UPSCALE | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 | ✅ |
@@ -68,6 +68,8 @@
 #include <aclnnop/aclnn_grouped_matmul_v3.h>
 #include <aclnnop/aclnn_fused_infer_attention_score_v2.h>
 #include <aclnnop/aclnn_zero.h>
+#include <aclnnop/aclnn_index_copy.h>
+#include <aclnnop/aclnn_index_select.h>
 #include <float.h>

 #include <cmath>
@@ -1614,50 +1616,97 @@ void ggml_cann_softmax(ggml_backend_cann_context& ctx, ggml_tensor* dst) {
 }

 /**
- * @brief Performs embedding operation on a 4D tensor using the CANN backend.
+ * @brief Performs index select operation on a 4D tensor using the CANN backend.
 *
- * This function extracts slices from the source tensor (`src_buffer`),
- * index tensor (`index`), and destination tensor (`dst`), and performs an
- * embedding operation on them. The embedding operation is applied by iterating
- * over the last two dimensions of the source tensor, creating the necessary
- * tensors for the source, index, and output, and executing the embedding operation.
+ * This function applies the `IndexSelect` operation along a specific dimension
+ * of the source tensor (`src_buffer`) using the indices from the index tensor (`index`).
+ * It iterates over the last two dimensions of the source tensor, creates the corresponding
+ * CANN tensors for the source, index, and output slices, and executes the `IndexSelect`
+ * operation for each slice.
 *
 * @param ctx The context for CANN backend operations.
- * @param src_buffer The source buffer holding the data for the source tensor.
+ * @param src_buffer The source buffer containing the 4D input tensor data.
 * @param src_ne The dimensions of the source tensor.
 * @param src_nb The strides (byte offsets) of the source tensor.
- * @param index The index tensor used in the embedding operation.
- * @param dst The destination tensor where the result will be stored.
+ * @param dst_buffer The destination buffer where the output tensor data will be written.
+ * @param dst_ne The dimensions of the destination tensor.
+ * @param dst_nb The strides (byte offsets) of the destination tensor.
+ * @param index The index tensor specifying the indices to select from the source tensor.
+ * @param type The data type of the source and destination tensors.
 */
-static void aclnn_embedding_4d(ggml_backend_cann_context& ctx, void* src_buffer,
-                            int64_t* src_ne, size_t* src_nb, ggml_tensor* index,
-                            ggml_tensor* dst) {
+static void aclnn_index_select_4d(ggml_backend_cann_context& ctx,
+                                void* src_buffer,int64_t* src_ne, size_t* src_nb,
+                                void* dst_buffer, int64_t* dst_ne, size_t* dst_nb,
+                                ggml_tensor* index, ggml_type type) {
    for (int64_t i = 0; i < src_ne[3]; i++) {
        for (int64_t j = 0; j < src_ne[2]; j++) {
            // src
-            int64_t acl_src_ne[2] = {src_ne[0], src_ne[1]};
-            size_t acl_src_nb[2] = {src_nb[0], src_nb[1]};
            aclTensor* acl_src_tensor = ggml_cann_create_tensor(
                (char*)src_buffer + i * src_nb[3] + j * src_nb[2],
-                ggml_cann_type_mapping(dst->type), ggml_element_size(dst),
-                acl_src_ne, acl_src_nb, 2);
+                ggml_cann_type_mapping(type), ggml_type_size(type),
+                src_ne, src_nb, 2);

            // index
-            int64_t acl_index_ne[1] = {index->ne[0]};
-            size_t acl_index_nb[1] = {index->nb[0]};
            aclTensor* acl_index = ggml_cann_create_tensor(
-                (char*)index->data + i * index->nb[2] + j * index->nb[1],
+                (char*)index->data + (i % index->ne[2]) * index->nb[2] + (j % index->ne[1]) * index->nb[1],
                ggml_cann_type_mapping(index->type), ggml_element_size(index),
-                acl_index_ne, acl_index_nb, 1);
+                index->ne, index->nb, 1);

            // out
-            int64_t acl_out_ne[2] = {dst->ne[0], dst->ne[1]};
-            size_t acl_out_nb[2] = {dst->nb[0], dst->nb[1]};
            aclTensor* acl_out = ggml_cann_create_tensor(
-                (char*)dst->data + i * dst->nb[3] + j * dst->nb[2],
-                ggml_cann_type_mapping(dst->type), ggml_element_size(dst),
-                acl_out_ne, acl_out_nb, 2);
-            GGML_CANN_CALL_ACLNN_OP(ctx, Embedding, acl_src_tensor, acl_index, acl_out);
+                (char*)dst_buffer + i * dst_nb[3] + j * dst_nb[2],
+                ggml_cann_type_mapping(type), ggml_type_size(type),
+                dst_ne, dst_nb, 2);
+            GGML_CANN_CALL_ACLNN_OP(ctx, IndexSelect, acl_src_tensor, 0, acl_index, acl_out);
+            ggml_cann_release_resources(ctx, acl_src_tensor, acl_index, acl_out);
+        }
+    }
+}
+
+/**
+ * @brief Performs inplace index copy operation on a 4D tensor using the CANN backend.
+ *
+ * This function applies the `IndexCopy` operation along a specific dimension of the
+ * destination tensor (`dst_buffer`) by copying elements from the source tensor (`src_buffer`)
+ * to positions specified by the index tensor (`index`).
+ * It iterates over the last two dimensions of the tensors, creates the corresponding
+ * CANN tensors for source, index, and destination slices, and performs the index copy
+ * operation for each slice.
+ *
+ * @param ctx The context for CANN backend operations.
+ * @param src_buffer The source buffer containing the 4D input tensor data to be copied.
+ * @param src_ne The dimensions of the source tensor.
+ * @param src_nb The strides (byte offsets) of the source tensor.
+ * @param dst_buffer The destination buffer where values will be copied to.
+ * @param dst_ne The dimensions of the destination tensor.
+ * @param dst_nb The strides (byte offsets) of the destination tensor.
+ * @param index The index tensor specifying target positions in the destination tensor.
+ * @param type The data type of the source and destination tensors.
+ */
+static void aclnn_index_copy_4d(ggml_backend_cann_context& ctx,
+                                void* src_buffer,int64_t* src_ne, size_t* src_nb,
+                                void* dst_buffer, int64_t* dst_ne, size_t* dst_nb,
+                                ggml_tensor* index, ggml_type type) {
+    for (int64_t i = 0; i < src_ne[3]; i++) {
+        for (int64_t j = 0; j < src_ne[2]; j++) {
+            // src
+            aclTensor* acl_src_tensor = ggml_cann_create_tensor(
+                (char*)src_buffer + i * src_nb[3] + j * src_nb[2],
+                ggml_cann_type_mapping(type), ggml_type_size(type),
+                src_ne, src_nb, 2);
+
+            // index
+            aclTensor* acl_index = ggml_cann_create_tensor(
+                (char*)index->data + (i % index->ne[2]) * index->nb[2] + (j % index->ne[1]) * index->nb[1],
+                ggml_cann_type_mapping(index->type), ggml_element_size(index),
+                index->ne, index->nb, 1);
+
+            // out
+            aclTensor* acl_out = ggml_cann_create_tensor(
+                (char*)dst_buffer + i * dst_nb[3] + j * dst_nb[2],
+                ggml_cann_type_mapping(type), ggml_type_size(type),
+                dst_ne, dst_nb, 2);
+            GGML_CANN_CALL_ACLNN_OP(ctx, InplaceIndexCopy, acl_out, 0, acl_index, acl_src_tensor);
            ggml_cann_release_resources(ctx, acl_src_tensor, acl_index, acl_out);
        }
    }
@@ -1669,8 +1718,9 @@ void ggml_cann_get_rows(ggml_backend_cann_context& ctx, ggml_tensor* dst) {

    switch (src0->type) {
        case GGML_TYPE_F32: {
-            aclnn_embedding_4d(ctx, src0->data, src0->ne, src0->nb, src1,
-                                   dst);
+            aclnn_index_select_4d(ctx, src0->data, src0->ne, src0->nb,
+                                dst->data, dst->ne, dst->nb,
+                                src1, dst->type);
            break;
        }
        case GGML_TYPE_F16: {
@@ -1687,8 +1737,9 @@ void ggml_cann_get_rows(ggml_backend_cann_context& ctx, ggml_tensor* dst) {
                src_trans_buffer, ACL_FLOAT, ggml_type_size(dst->type),
                src0->ne, src_trans_nb, GGML_MAX_DIMS);
            aclnn_cast(ctx, acl_src0, src_trans_tensor, ggml_cann_type_mapping(dst->type));
-            aclnn_embedding_4d(ctx, src_trans_buffer, src0->ne,
-                                   src_trans_nb, src1, dst);
+            aclnn_index_select_4d(ctx, src_trans_buffer, src0->ne, src_trans_nb,
+                                dst->data, dst->ne, dst->nb,
+                                src1, dst->type);
            ggml_cann_release_resources(ctx, acl_src0, src_trans_tensor);
            break;
        }
@@ -1748,8 +1799,10 @@ void ggml_cann_get_rows(ggml_backend_cann_context& ctx, ggml_tensor* dst) {
                dequant_nb[i] = dequant_nb[i - 1] * src0->ne[i - 1];
            }

-            aclnn_embedding_4d(ctx, dequant_buffer_allocator.get(),
-                                   dequant_ne, dequant_nb, src1, dst);
+            aclnn_index_select_4d(ctx, dequant_buffer_allocator.get(),
+                                   dequant_ne, dequant_nb,
+                                   dst->data, dst->ne, dst->nb,
+                                   src1, dst->type);

            ggml_cann_release_resources(ctx, dequant_tensor);
            break;
@@ -1760,6 +1813,43 @@ void ggml_cann_get_rows(ggml_backend_cann_context& ctx, ggml_tensor* dst) {
    }
 }

+void ggml_cann_set_rows(ggml_backend_cann_context& ctx, ggml_tensor* dst) {
+    ggml_tensor* src0 = dst->src[0];  // src
+    ggml_tensor* src1 = dst->src[1];  // index
+
+    switch (dst->type) {
+        case GGML_TYPE_F32: {
+            aclnn_index_copy_4d(ctx, src0->data, src0->ne, src0->nb,
+                                dst->data, dst->ne, dst->nb,
+                                src1, dst->type);
+            break;
+        }
+        case GGML_TYPE_F16: {
+            aclTensor* acl_src0 = ggml_cann_create_tensor(src0);
+            ggml_cann_pool_alloc src_buffer_allocator(
+                ctx.pool(), ggml_nelements(src0) * sizeof(uint16_t));
+            void* src_trans_buffer = src_buffer_allocator.get();
+            size_t src_trans_nb[GGML_MAX_DIMS];
+            src_trans_nb[0] = sizeof(uint16_t);
+            for (int i = 1; i < GGML_MAX_DIMS; i++) {
+                src_trans_nb[i] = src_trans_nb[i - 1] * src0->ne[i - 1];
+            }
+            aclTensor* src_trans_tensor = ggml_cann_create_tensor(
+                src_trans_buffer, ACL_FLOAT16, ggml_type_size(dst->type),
+                src0->ne, src_trans_nb, GGML_MAX_DIMS);
+            aclnn_cast(ctx, acl_src0, src_trans_tensor, ggml_cann_type_mapping(dst->type));
+            aclnn_index_copy_4d(ctx, src_trans_buffer, src0->ne, src_trans_nb,
+                                dst->data, dst->ne, dst->nb,
+                                src1, dst->type);
+            ggml_cann_release_resources(ctx, acl_src0, src_trans_tensor);
+            break;
+        }
+        default:
+            GGML_ABORT("Unsupported tensor type for GGML_OP_SET_ROWS");
+            break;
+    }
+}
+
 /**
 * @brief Repeats elements of a tensor along a specified dimension.
 *
@@ -424,15 +424,25 @@ void ggml_cann_softmax(ggml_backend_cann_context& ctx, ggml_tensor* dst);
 *
 * @details This function retrieves rows from a source tensor src0 according to
 *          the indices provided in another tensor src1 and stores the result in
- *          a destination tensor (\p dst). It supports different data types
- *          including F32, F16, Q4_0, and Q8_0.
+ *          a destination tensor (\p dst).
 *
 * @param ctx The backend CANN context for executing operations.
 * @param dst The destination tensor where the extracted rows will be stored.
- *            dst->op is `GGML_OP_GET_ROWS`.
 */
 void ggml_cann_get_rows(ggml_backend_cann_context& ctx, ggml_tensor* dst);

+/**
+ * @brief   Writes specific rows into a tensor at positions specified by indices.
+ *
+ * @details This function copies rows from a source tensor into a destination
+ *          tensor (\p dst) at the positions indicated by the indices in another
+ *          tensor.
+ *
+ * @param ctx The backend CANN context for executing operations.
+ * @param dst The destination tensor where the specified rows will be updated.
+ */
+void ggml_cann_set_rows(ggml_backend_cann_context& ctx, ggml_tensor* dst);
+
 /**
 * @brief   Executes matrix multiplication for the given tensor.
 *
@@ -1659,6 +1659,9 @@ static bool ggml_cann_compute_forward(ggml_backend_cann_context& ctx,
        case GGML_OP_GET_ROWS:
            ggml_cann_get_rows(ctx, dst);
            break;
+        case GGML_OP_SET_ROWS:
+            ggml_cann_set_rows(ctx, dst);
+            break;
        case GGML_OP_DUP:
            ggml_cann_dup(ctx, dst);
            break;
@@ -2191,13 +2194,15 @@ static bool ggml_backend_cann_supports_op(ggml_backend_dev_t dev,
                    return false;
            }
        } break;
-        case GGML_OP_SET_ROWS:
-            {
-                // TODO: add support
-                // ref: https://github.com/ggml-org/llama.cpp/pull/14274
-#pragma message("TODO: implement F32, F16, BF16, Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, IQ4_NL support (https://github.com/ggml-org/llama.cpp/pull/14661)")
-                return false;
-            } break;
+        case GGML_OP_SET_ROWS: {
+            switch (op->type) {
+                case GGML_TYPE_F32:
+                case GGML_TYPE_F16:
+                    return true;
+                default:
+                    return false;
+            }
+        } break;
        case GGML_OP_CPY: {
            ggml_tensor *src = op->src[0];
            if ((op->type != GGML_TYPE_F32 && op->type != GGML_TYPE_F16) ||
@@ -31,7 +31,9 @@
 #include "ggml-cuda/pool2d.cuh"
 #include "ggml-cuda/quantize.cuh"
 #include "ggml-cuda/rope.cuh"
+#include "ggml-cuda/roll.cuh"
 #include "ggml-cuda/scale.cuh"
+#include "ggml-cuda/softcap.cuh"
 #include "ggml-cuda/softmax.cuh"
 #include "ggml-cuda/ssm-conv.cuh"
 #include "ggml-cuda/ssm-scan.cuh"
@@ -2419,6 +2421,9 @@ static bool ggml_cuda_compute_forward(ggml_backend_cuda_context & ctx, struct gg
        case GGML_OP_ROPE_BACK:
            ggml_cuda_op_rope_back(ctx, dst);
            break;
+        case GGML_OP_ROLL:
+            ggml_cuda_op_roll(ctx, dst);
+            break;
        case GGML_OP_IM2COL:
            ggml_cuda_op_im2col(ctx, dst);
            break;
@@ -2766,7 +2771,12 @@ static void update_cuda_graph_executable(ggml_backend_cuda_context * cuda_ctx) {
 }
 #endif

-static bool ggml_cuda_can_fuse(const struct ggml_cgraph * cgraph, int node_idx, std::initializer_list<enum ggml_op> ops) {
+static bool ggml_cuda_can_fuse(const struct ggml_cgraph * cgraph, int node_idx, std::initializer_list<enum ggml_op> ops, std::initializer_list<enum ggml_unary_op> unary_ops) {
+#ifndef NDEBUG
+    const size_t num_unary = std::count(ops.begin(), ops.end(), GGML_OP_UNARY);
+    GGML_ASSERT(unary_ops.size() == num_unary);
+#endif
+
    if (!ggml_can_fuse(cgraph, node_idx, ops)) {
        return false;
    }
@@ -2794,9 +2804,32 @@ static bool ggml_cuda_can_fuse(const struct ggml_cgraph * cgraph, int node_idx,
        if (!ggml_is_contiguous_rows(mul->src[0]) || !ggml_is_contiguous_rows(mul->src[1])) {
            return false;
        }
+
+        return true;
    }

-    return true;
+    if (ops.size() == 3 && ops.begin()[0] == GGML_OP_SCALE && ops.begin()[1] == GGML_OP_UNARY && ops.begin()[2] == GGML_OP_SCALE
+     && unary_ops.size() == 1 && unary_ops.begin()[0] == GGML_UNARY_OP_TANH) {
+        const ggml_tensor *scale  = cgraph->nodes[node_idx];
+        const ggml_tensor *tanh   = cgraph->nodes[node_idx+1];
+        const ggml_tensor *scale2 = cgraph->nodes[node_idx+2];
+
+        GGML_ASSERT(scale->src[0]->type == GGML_TYPE_F32);
+        GGML_ASSERT(scale->type == GGML_TYPE_F32);
+
+        if (ggml_get_unary_op(tanh) != GGML_UNARY_OP_TANH) {
+            return false;
+        }
+
+        // Check for bias
+        if (ggml_get_op_params_f32(scale, 1) != 0.0f || ggml_get_op_params_f32(scale2, 1) != 0.0f) {
+            return false;
+        }
+
+        return true;
+    }
+
+    return false;
 }

 static void evaluate_and_capture_cuda_graph(ggml_backend_cuda_context * cuda_ctx, ggml_cgraph * cgraph,
@@ -2817,10 +2850,18 @@ static void evaluate_and_capture_cuda_graph(ggml_backend_cuda_context * cuda_ctx
                }

                static bool disable_fusion = (getenv("GGML_CUDA_DISABLE_FUSION") != nullptr);
-                if (!disable_fusion && ggml_cuda_can_fuse(cgraph, i, { GGML_OP_RMS_NORM, GGML_OP_MUL })) {
-                    ggml_cuda_op_rms_norm_fused(*cuda_ctx, node, cgraph->nodes[i+1]);
-                    i++;
-                    continue;
+                if (!disable_fusion) {
+                    if (ggml_cuda_can_fuse(cgraph, i, { GGML_OP_RMS_NORM, GGML_OP_MUL }, {})) {
+                        ggml_cuda_op_rms_norm_fused(*cuda_ctx, node, cgraph->nodes[i+1]);
+                        i++;
+                        continue;
+                    }
+
+                    if (ggml_cuda_can_fuse(cgraph, i, { GGML_OP_SCALE, GGML_OP_UNARY, GGML_OP_SCALE }, { GGML_UNARY_OP_TANH })) {
+                        i += 2;
+                        ggml_cuda_op_softcap(*cuda_ctx, cgraph->nodes[i], node);
+                        continue;
+                    }
                }
 #ifndef NDEBUG
                assert(node->buffer->buft == ggml_backend_cuda_buffer_type(cuda_ctx->device));
@@ -3411,6 +3452,11 @@ static bool ggml_backend_cuda_device_supports_op(ggml_backend_dev_t dev, const g
            memcpy(&max_bias, (const float *) op->op_params + 1, sizeof(float));
            return max_bias == 0.0f;
        }
+        case GGML_OP_ROLL:
+            if(op->src[0]->type == GGML_TYPE_F32) {
+                return true;
+            }
+            return false;
        case GGML_OP_ROPE:
        case GGML_OP_ROPE_BACK: {
            return op->src[0]->nb[0] == ggml_type_size(op->src[0]->type) && ggml_is_contiguous_2(op->src[0]);
@@ -0,0 +1,67 @@
+#include "ggml-cuda/common.cuh"
+#include "roll.cuh"
+
+static __forceinline__ __device__ int64_t wrap_index(const int64_t idx, const int64_t ne) {
+    if (idx < 0) {
+        return idx + ne;
+    }
+    if (idx >= ne) {
+        return idx - ne;
+    }
+    return idx;
+}
+
+static __global__ void roll_f32_cuda(const float * __restrict__ src,
+                                     float * __restrict__ dst,
+                                     const int64_t ne00,
+                                     const int64_t ne01,
+                                     const int64_t ne02,
+                                     const int64_t ne03,
+                                     const int     s0,
+                                     const int     s1,
+                                     const int     s2,
+                                     const int     s3) {
+    const int64_t idx        = int64_t(blockDim.x) * blockIdx.x + threadIdx.x;
+    const int64_t n_elements = ne00 * ne01 * ne02 * ne03;
+
+    if (idx >= n_elements) {
+        return;
+    }
+
+    const int64_t i0 = idx % ne00;
+    const int64_t i1 = (idx / ne00) % ne01;
+    const int64_t i2 = (idx / (ne00 * ne01)) % ne02;
+    const int64_t i3 = (idx / (ne00 * ne01 * ne02)) % ne03;
+
+    const int64_t d0 = wrap_index(i0 - s0, ne00);
+    const int64_t d1 = wrap_index(i1 - s1, ne01);
+    const int64_t d2 = wrap_index(i2 - s2, ne02);
+    const int64_t d3 = wrap_index(i3 - s3, ne03);
+
+    dst[i3 * (ne00 * ne01 * ne02) + i2 * (ne01 * ne00) + i1 * ne00 + i0] =
+        src[d3 * (ne00 * ne01 * ne02) + d2 * (ne01 * ne00) + d1 * ne00 + d0];
+}
+
+void ggml_cuda_op_roll(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
+    int s0 = dst->op_params[0];
+    int s1 = dst->op_params[1];
+    int s2 = dst->op_params[2];
+    int s3 = dst->op_params[3];
+
+    const ggml_tensor * src0   = dst->src[0];
+    const float *       src0_d = (const float *) dst->src[0]->data;
+    float *             dst_d  = (float *) dst->data;
+
+    GGML_TENSOR_UNARY_OP_LOCALS;
+
+    GGML_ASSERT(dst->src[0]->type == GGML_TYPE_F32);
+    GGML_ASSERT(ggml_are_same_shape(dst->src[0], dst));
+
+    cudaStream_t stream = ctx.stream();
+
+    int64_t sz         = (ne00 * ne01 * ne02 * ne03);
+    int64_t num_blocks = (sz + CUDA_ROLL_BLOCK_SIZE - 1) / CUDA_ROLL_BLOCK_SIZE;
+
+    roll_f32_cuda<<<num_blocks, CUDA_ROLL_BLOCK_SIZE, 0, stream>>>(
+        src0_d, dst_d, ne00, ne01, ne02, ne03, s0, s1, s2, s3);
+}
@@ -0,0 +1,5 @@
+#include "common.cuh"
+
+#define CUDA_ROLL_BLOCK_SIZE 256
+
+void ggml_cuda_op_roll(ggml_backend_cuda_context & ctx, ggml_tensor * dst);
@@ -0,0 +1,34 @@
+#include "softcap.cuh"
+
+static __global__ void softcap_f32(const float * x, float * dst, const float scale, const float softcap, const int k) {
+    const int i = blockDim.x*blockIdx.x + threadIdx.x;
+
+    if (i >= k) {
+        return;
+    }
+
+    dst[i] = tanhf(scale * x[i]) * softcap;
+}
+
+static void softcap_f32_cuda(const float * x, float * dst, const float scale, const float softcap, const int k, cudaStream_t stream) {
+    const int num_blocks = (k + CUDA_SOFTCAP_BLOCK_SIZE - 1) / CUDA_SOFTCAP_BLOCK_SIZE;
+    softcap_f32<<<num_blocks, CUDA_SOFTCAP_BLOCK_SIZE, 0, stream>>>(x, dst, scale, softcap, k);
+}
+
+// fused GGML_OP_SCALE + GGML_UNARY_OP_TANH + GGML_OP_SCALE
+void ggml_cuda_op_softcap(ggml_backend_cuda_context & ctx, ggml_tensor * dst, ggml_tensor * src) {
+    const ggml_tensor * src0 = src->src[0];
+    const float * src0_d = (const float *)src0->data;
+    float * dst_d = (float *)dst->data;
+    cudaStream_t stream = ctx.stream();
+
+    GGML_ASSERT(src0->type == GGML_TYPE_F32);
+    GGML_ASSERT( dst->type == GGML_TYPE_F32);
+
+    float scale;
+    float softcap;
+    memcpy(&scale,   (float *) src->op_params + 0, sizeof(float));
+    memcpy(&softcap, (float *) dst->op_params + 0, sizeof(float));
+
+    softcap_f32_cuda(src0_d, dst_d, scale, softcap, ggml_nelements(src0), stream);
+}
@@ -0,0 +1,5 @@
+#include "common.cuh"
+
+#define CUDA_SOFTCAP_BLOCK_SIZE 256
+
+void ggml_cuda_op_softcap(ggml_backend_cuda_context & ctx, ggml_tensor * dst, ggml_tensor * src);
@@ -32,11 +32,12 @@ def get_prompts_text(dataset_name: str, n_prompts: int) -> Optional[list[str]]:
    return ret


-def get_prompt_lengths_rng(n_prompts: int, prompt_length_min: int, prompt_length_max: int) -> list[int]:
+def get_prompt_lengths_rng(n_prompts: int, prompt_length_min: int, prompt_length_max: int, seed_offset: int) -> list[int]:
    assert n_prompts >= 0
    ret: list[int] = []
    for i in range(n_prompts):
-        random.seed(13 * i + 0)
+        if seed_offset >= 0:
+            random.seed(3 * (seed_offset + 1000 * i) + 0)
        ret.append(random.randint(prompt_length_min, prompt_length_max))
    return ret

@@ -46,12 +47,20 @@ def get_prompts_rng(prompt_lengths: list[int]) -> list[list[int]]:


 def get_server(path_server: str, path_log: Optional[str]) -> dict:
-    logger.info("Starting the llama.cpp server...")
-    hostname: str = os.environ.get("LLAMA_ARG_HOST", "127.0.0.1")
-    port: str = os.environ.get("LLAMA_ARG_PORT", "8080")
+    if os.environ.get("LLAMA_ARG_HOST") is None:
+        logger.info("LLAMA_ARG_HOST not explicitly set, using 127.0.0.1")
+        os.environ["LLAMA_ARG_HOST"] = "127.0.0.1"
+    if os.environ.get("LLAMA_ARG_PORT") is None:
+        logger.info("LLAMA_ARG_PORT not explicitly set, using 8080")
+        os.environ["LLAMA_ARG_PORT"] = "8080"
+    hostname: Optional[str] = os.environ.get("LLAMA_ARG_HOST")
+    port: Optional[str] = os.environ.get("LLAMA_ARG_PORT")
+    assert hostname is not None
+    assert port is not None
    address: str = f"http://{hostname}:{port}"
+    logger.info(f"Starting the llama.cpp server under {address}...")

-    fout = open(path_log, "w") if path_log is not None else subprocess.DEVNULL
+    fout = open(path_log.format(port=port), "w") if path_log is not None else subprocess.DEVNULL
    process = subprocess.Popen([path_server], stdout=fout, stderr=subprocess.STDOUT)

    n_failures: int = 0
@@ -60,7 +69,7 @@ def get_server(path_server: str, path_log: Optional[str]) -> dict:
            sleep(1.0)
            exit_code = process.poll()
            if exit_code is not None:
-                raise RuntimeError(f"llama.cpp server exited unexpectedly with exit code {exit_code}, see {path_log}")
+                raise RuntimeError(f"llama.cpp server exited unexpectedly with exit code {exit_code}{path_log and f', see {path_log.format(port=port)}' or ''}")
            response = requests.get(f"{address}/health")
            if response.status_code == 200:
                break
@@ -128,7 +137,7 @@ def send_prompt(data: dict) -> tuple[float, list[float]]:
    return (t_submit, token_arrival_times)


-def benchmark(path_server: str, path_log: Optional[str], prompt_source: str, n_prompts: int, n_predict: int, n_predict_min: int):
+def benchmark(path_server: str, path_log: Optional[str], prompt_source: str, n_prompts: int, n_predict: int, n_predict_min: int, seed_offset: int):
    if os.environ.get("LLAMA_ARG_N_PARALLEL") is None:
        logger.info("LLAMA_ARG_N_PARALLEL not explicitly set, using 32")
        os.environ["LLAMA_ARG_N_PARALLEL"] = "32"
@@ -139,7 +148,7 @@ def benchmark(path_server: str, path_log: Optional[str], prompt_source: str, n_p
        logger.info("LLAMA_ARG_FLASH_ATTN not explicitly set, using 'true'")
        os.environ["LLAMA_ARG_FLASH_ATTN"] = "true"

-    parallel: int = int(os.environ.get("LLAMA_ARG_N_PARALLEL", 1))
+    parallel: int = int(os.environ.get("LLAMA_ARG_N_PARALLEL")) # type: ignore
    prompts: Union[None, list[str], list[list[int]]] = get_prompts_text(prompt_source, n_prompts)
    synthetic_prompts: bool = prompts is None
    prompt_n = []
@@ -151,7 +160,7 @@ def benchmark(path_server: str, path_log: Optional[str], prompt_source: str, n_p
        prompt_length_min: int = int(prompt_source_split[1])
        prompt_length_max: int = int(prompt_source_split[2])
        logger.info("Generating random prompts...")
-        prompt_n = get_prompt_lengths_rng(n_prompts, prompt_length_min, prompt_length_max)
+        prompt_n = get_prompt_lengths_rng(n_prompts, prompt_length_min, prompt_length_max, seed_offset)
        prompts = get_prompts_rng(prompt_n)
    else:
        n_predict_min = n_predict
@@ -176,10 +185,11 @@ def benchmark(path_server: str, path_log: Optional[str], prompt_source: str, n_p
        data: list[dict] = []

        for i, p in enumerate(prompts):
-            random.seed(13 * i + 1)
+            if seed_offset >= 0:
+                random.seed(3 * (seed_offset + 1000 * i) + 1)
            data.append({
                "session": session, "server_address": server_address, "prompt": p, "synthetic_prompt": synthetic_prompts,
-                "n_predict": random.randint(n_predict_min, n_predict), "seed": 13 * i + 2})
+                "n_predict": random.randint(n_predict_min, n_predict), "seed": (3 * (seed_offset + 1000 * i) + 2) if seed_offset >= 0 else -1})

        if not synthetic_prompts:
            logger.info("Getting the prompt lengths...")
@@ -251,7 +261,7 @@ if __name__ == "__main__":
        "Results are printed to console and visualized as plots (saved to current working directory). "
        "To pass arguments such as the model path to the server, set the corresponding environment variables (see llama-server --help).")
    parser.add_argument("--path_server", type=str, default="llama-server", help="Path to the llama.cpp server binary")
-    parser.add_argument("--path_log", type=str, default="server-bench.log", help="Path to the model to use for the benchmark")
+    parser.add_argument("--path_log", type=str, default="server-bench-{port}.log", help="Path to the model to use for the benchmark")
    parser.add_argument(
        "--prompt_source", type=str, default="rng-1024-2048",
        help="How to get the prompts for the benchmark, either 'mmlu' for MMLU questions or "
@@ -261,5 +271,7 @@ if __name__ == "__main__":
    parser.add_argument(
        "--n_predict_min", type=int, default=1024,
        help="Min. number of tokens to predict per prompt (supported for synthetic prompts only)")
+    parser.add_argument("--seed_offset", type=int, default=0, help="Offset for determining the seeds for pseudorandom prompt/generation lengths. "
+                        "Corelations between seeds can occur when set >= 1000. Negative values mean no seed.")
    args = parser.parse_args()
    benchmark(**vars(args))
@@ -35,6 +35,7 @@
 #include <random>
 #include <regex>
 #include <string>
+#include <string_view>
 #include <thread>
 #include <vector>

@@ -1047,7 +1048,37 @@ struct test_case {
        return t;
    }

-    bool eval(ggml_backend_t backend1, ggml_backend_t backend2, const char * op_name, printer * output_printer) {
+    // Checks an op against the test filter, which is a comma separated list of OP names or specific variations
+    bool matches_filter(ggml_tensor * op, const char * op_names_filter) {
+        if (op_names_filter) {
+            const auto op_name = op_desc(op);
+            const auto op_full_name = op_name + "(" + vars() + ")";
+            std::string_view filter(op_names_filter);
+            while (!filter.empty()) {
+                auto comma_pos = filter.find_first_of(',');
+                const auto lparen_pos = filter.find_first_of('(');
+                if (lparen_pos < comma_pos) {
+                    auto rparen_pos = filter.find_first_of(')');
+                    comma_pos = filter.find_first_of(',', rparen_pos);
+                    const auto op_filter = filter.substr(0, comma_pos);
+                    if (op_filter == op_full_name) {
+                        return true;
+                    }
+                } else {
+                    const auto op_filter = filter.substr(0, comma_pos);
+                    if (op_filter == op_name) {
+                        return true;
+                    }
+                }
+                filter = comma_pos != std::string_view::npos ? filter.substr(comma_pos + 1) : "";
+            }
+            return false;
+        } else {
+            return true;
+        }
+    }
+
+    bool eval(ggml_backend_t backend1, ggml_backend_t backend2, const char * op_names_filter, printer * output_printer) {
        mode = MODE_TEST;

        ggml_init_params params = {
@@ -1065,7 +1096,7 @@ struct test_case {

        ggml_tensor * out = build_graph(ctx);
        std::string current_op_name = op_desc(out);
-        if (op_name != nullptr && current_op_name != op_name) {
+        if (!matches_filter(out, op_names_filter)) {
            //printf("  %s: skipping\n", op_desc(out).c_str());
            ggml_free(ctx);
            return true;
@@ -1212,7 +1243,7 @@ struct test_case {
        return test_passed;
    }

-    bool eval_perf(ggml_backend_t backend, const char * op_name, printer * output_printer) {
+    bool eval_perf(ggml_backend_t backend, const char * op_names_filter, printer * output_printer) {
        mode = MODE_PERF;

        static const size_t graph_nodes = 8192;
@@ -1227,7 +1258,7 @@ struct test_case {

        ggml_tensor * out             = build_graph(ctx.get());
        std::string   current_op_name = op_desc(out);
-        if (op_name != nullptr && current_op_name != op_name) {
+        if (!matches_filter(out, op_names_filter)) {
            //printf("  %s: skipping\n", op_desc(out).c_str());
            return true;
        }
@@ -1342,7 +1373,7 @@ struct test_case {
        return true;
    }

-    bool eval_support(ggml_backend_t backend, const char * op_name, printer * output_printer) {
+    bool eval_support(ggml_backend_t backend, const char * op_names_filter, printer * output_printer) {
        mode = MODE_SUPPORT;

        static const size_t graph_nodes = 8192;
@@ -1357,7 +1388,7 @@ struct test_case {

        ggml_tensor * out             = build_graph(ctx.get());
        std::string   current_op_name = op_desc(out);
-        if (op_name != nullptr && current_op_name != op_name) {
+        if (!matches_filter(out, op_names_filter)) {
            return true;
        }

@@ -1374,7 +1405,7 @@ struct test_case {
        return true;
    }

-    bool eval_grad(ggml_backend_t backend, const char * op_name, printer * output_printer) {
+    bool eval_grad(ggml_backend_t backend, const char * op_names_filter, printer * output_printer) {
        mode = MODE_GRAD;
        const std::vector<float> expect = grad_expect();

@@ -1391,7 +1422,7 @@ struct test_case {

        ggml_tensor * out = build_graph(ctx.get());

-        if ((op_name != nullptr && op_desc(out) != op_name) || out->op == GGML_OP_OPT_STEP_ADAMW) {
+        if (!matches_filter(out, op_names_filter) || out->op == GGML_OP_OPT_STEP_ADAMW) {
            return true;
        }

@@ -2514,6 +2545,41 @@ struct test_scale : public test_case {
    }
 };

+// GGML_OP_SCALE + GGML_UNARY_OP_TANH + GGML_OP_SCALE
+struct test_softcap : public test_case {
+    const ggml_type type;
+    const std::array<int64_t, 4> ne;
+    float softcap;
+
+    std::string op_desc(ggml_tensor * t) override {
+        GGML_UNUSED(t);
+        return "SOFTCAP";
+    }
+
+    bool run_whole_graph() override { return true; }
+
+    std::string vars() override {
+        return VARS_TO_STR3(type, ne, softcap);
+    }
+
+    test_softcap(ggml_type type = GGML_TYPE_F32,
+            std::array<int64_t, 4> ne = {10, 10, 10, 10},
+            float softcap = 30.0f)
+        : type(type), ne(ne), softcap(softcap) {}
+
+    ggml_tensor * build_graph(ggml_context * ctx) override {
+        ggml_tensor * a = ggml_new_tensor(ctx, type, 4, ne.data());
+
+        ggml_set_param(a);
+        ggml_set_name(a, "a");
+
+        ggml_tensor * out = ggml_scale(ctx, ggml_tanh(ctx, ggml_scale(ctx, a, 1.0f / softcap)), softcap);
+        ggml_set_name(out, "out");
+
+        return out;
+    }
+};
+
 // GGML_OP_SILU_BACK
 struct test_silu_back : public test_case {
    const ggml_type type;
@@ -5390,6 +5456,7 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
    test_cases.emplace_back(new test_add1());
    test_cases.emplace_back(new test_scale());
    test_cases.emplace_back(new test_scale(GGML_TYPE_F32, {10, 10, 10, 10}, 2.0f, 1.0f));
+    test_cases.emplace_back(new test_softcap(GGML_TYPE_F32, {10, 10, 10, 10}, 50.0f));
    test_cases.emplace_back(new test_silu_back());

    for (float eps : {0.0f, 1e-6f, 1e-4f, 1e-1f}) {
@@ -5922,7 +5989,7 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_perf() {
    return test_cases;
 }

-static bool test_backend(ggml_backend_t backend, test_mode mode, const char * op_name, const char * params_filter,
+static bool test_backend(ggml_backend_t backend, test_mode mode, const char * op_names_filter, const char * params_filter,
                         printer * output_printer) {
    auto filter_test_cases = [](std::vector<std::unique_ptr<test_case>> & test_cases, const char * params_filter) {
        if (params_filter == nullptr) {
@@ -5954,7 +6021,7 @@ static bool test_backend(ggml_backend_t backend, test_mode mode, const char * op

        size_t n_ok = 0;
        for (auto & test : test_cases) {
-            if (test->eval(backend, backend_cpu, op_name, output_printer)) {
+            if (test->eval(backend, backend_cpu, op_names_filter, output_printer)) {
                n_ok++;
            }
        }
@@ -5970,7 +6037,7 @@ static bool test_backend(ggml_backend_t backend, test_mode mode, const char * op
        filter_test_cases(test_cases, params_filter);
        size_t n_ok = 0;
        for (auto & test : test_cases) {
-            if (test->eval_grad(backend, op_name, output_printer)) {
+            if (test->eval_grad(backend, op_names_filter, output_printer)) {
                n_ok++;
            }
        }
@@ -5983,7 +6050,7 @@ static bool test_backend(ggml_backend_t backend, test_mode mode, const char * op
        auto test_cases = make_test_cases_perf();
        filter_test_cases(test_cases, params_filter);
        for (auto & test : test_cases) {
-            test->eval_perf(backend, op_name, output_printer);
+            test->eval_perf(backend, op_names_filter, output_printer);
        }
        return true;
    }
@@ -5992,7 +6059,7 @@ static bool test_backend(ggml_backend_t backend, test_mode mode, const char * op
        auto test_cases = make_test_cases_eval();
        filter_test_cases(test_cases, params_filter);
        for (auto & test : test_cases) {
-            test->eval_support(backend, op_name, output_printer);
+            test->eval_support(backend, op_names_filter, output_printer);
        }
        return true;
    }
@@ -6001,20 +6068,21 @@ static bool test_backend(ggml_backend_t backend, test_mode mode, const char * op
 }

 static void usage(char ** argv) {
-    printf("Usage: %s [mode] [-o <op>] [-b <backend>] [-p <params regex>] [--output <console|sql|csv>]\n", argv[0]);
+    printf("Usage: %s [mode] [-o <op,..>] [-b <backend>] [-p <params regex>] [--output <console|sql|csv>]\n", argv[0]);
    printf("    valid modes:\n");
    printf("      - test (default, compare with CPU backend for correctness)\n");
    printf("      - grad (compare gradients from backpropagation with method of finite differences)\n");
    printf("      - perf (performance evaluation)\n");
    printf("      - support (probe backend operation support)\n");
-    printf("    op names for -o are as given by ggml_op_desc() (e.g. ADD, MUL_MAT, etc)\n");
+    printf("    op names for -o are as given by ggml_op_desc() (e.g. ADD, MUL_MAT, etc),\n");
+    printf("        optionally including the full test case string (e.g. \"ADD(type=f16,ne=[1,1,8,1],nr=[1,1,1,1],nf=1)\")\n");
    printf("    --output specifies output format (default: console, options: console, sql, csv)\n");
 }

 int main(int argc, char ** argv) {
    test_mode mode = MODE_TEST;
    output_formats output_format = CONSOLE;
-    const char * op_name_filter = nullptr;
+    const char * op_names_filter = nullptr;
    const char * backend_filter = nullptr;
    const char * params_filter = nullptr;

@@ -6029,7 +6097,7 @@ int main(int argc, char ** argv) {
            mode = MODE_SUPPORT;
        } else if (strcmp(argv[i], "-o") == 0) {
            if (i + 1 < argc) {
-                op_name_filter = argv[++i];
+                op_names_filter = argv[++i];
            } else {
                usage(argv);
                return 1;
@@ -6110,7 +6178,7 @@ int main(int argc, char ** argv) {
                                                             false, "", ggml_backend_dev_description(dev),
                                                             total / 1024 / 1024, free / 1024 / 1024, true));

-        bool ok = test_backend(backend, mode, op_name_filter, params_filter, output_printer.get());
+        bool ok = test_backend(backend, mode, op_names_filter, params_filter, output_printer.get());

        if (ok) {
            n_ok++;
Author	SHA1	Message	Date
kallewoof	1a67fcc306	common : avoid logging partial messages (which can contain broken UTF-8 sequences) (#14937 ) * bug-fix: don't attempt to log partial parsed messages to avoid crash due to unfinished UTF-8 sequences	2025-07-29 17:05:38 +02:00
hipudding	204f2cf168	CANN: Add ggml_set_rows (#14943 )	2025-07-29 22:36:43 +08:00
Sigbjørn Skjæret	138b288b59	cuda : add softcap fusion (#14907 )	2025-07-29 14:22:03 +02:00
Johannes Gäßler	bbd0f91779	server-bench: make seed choice configurable (#14929 ) * server-bench: make seed choice configurable * Update scripts/server-bench.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update scripts/server-bench.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix error formatting * Update scripts/server-bench.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-07-29 10:40:50 +02:00
Aman Gupta	0a5036bee9	CUDA: add roll (#14919 ) * CUDA: add roll * Make everything const, use __restrict__	2025-07-29 14:45:18 +08:00
lhez	8ad7b3e65b	opencl : add ops docs (#14910 )	2025-07-28 18:50:17 +02:00
Leonard Mosescu	bda62193b2	test-backend-ops : extend test case filtering (#14865 ) * Extend test case filtering 1. Allow passing multiple (comma-separated?) ops to test-backend-ops. This can be convenient when working on a set of ops, when you'd want to test them together (but without having to run every single op). For example: `test-backend-ops.exe test -o "ADD,RMS_NORM,ROPE,SILU,SOFT_MAX"` 2. Support full test-case variation string in addition to basic op names. This would make it easy to select a single variation, either for testing or for benchmarking. It can be particularly useful for profiling a particular variation (ex. a CUDA kernel), for example: `test-backend-ops.exe perf -b CUDA0 -o "MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=2)"` These two can be combined. As the current `-o`, this change doesn't try to detect/report an error if an filter doesn't name existing ops (ex. misspelled) * Updating the usage help text * Update tests/test-backend-ops.cpp	2025-07-28 18:04:27 +02:00