rpc : proper handling of data pointers to CPU buffers (#21030 )

The compute graph may contain tensors pointing to CPU buffers. In these cases the buffer address is serialized as 0 and sent over the wire. However, the data pointer is serialized as-is and this prevents proper validation on the server side. This patches fixes this by serializing the data pointer as 0 for non-RPC buffers and doing proper validation on the server side. closes: #21006
completion : session_tokens insert range in completion tool (no-op → correct) (#20917 )
2026-07-01 10:07:44 +02:00 · 2026-03-27 10:59:35 +02:00 · 2026-03-27 09:25:58 +01:00 · 2026-03-27 10:01:13 +02:00 · 2026-03-27 08:17:35 +01:00 · 2026-03-27 09:05:21 +02:00
19 changed files with 380 additions and 75 deletions
@@ -4,7 +4,7 @@

 # Define the CANN base image for easier version updates later
 ARG CHIP_TYPE=910b
-ARG CANN_BASE_IMAGE=quay.io/ascend/cann:8.3.rc2-${CHIP_TYPE}-openeuler24.03-py3.11
+ARG CANN_BASE_IMAGE=quay.io/ascend/cann:8.5.0-${CHIP_TYPE}-openeuler24.03-py3.11

 # ==============================================================================
 # BUILD STAGE
@@ -1,4 +1,4 @@
-ARG ASCEND_VERSION=8.1.RC1.alpha001-910b-openeuler22.03-py3.10
+ARG ASCEND_VERSION=8.5.0-910b-openeuler22.03-py3.10

 FROM ascendai/cann:$ASCEND_VERSION AS build

@@ -63,7 +63,7 @@ jobs:
      - name: Set container image
        id: cann-image
        run: |
-          image="ascendai/cann:${{ matrix.chip_type == '910b' &&  '8.3.rc2-910b-openeuler24.03-py3.11' || '8.3.rc2-310p-openeuler24.03-py3.11' }}"
+          image="ascendai/cann:${{ matrix.chip_type == '910b' &&  '8.5.0-910b-openeuler24.03-py3.11' || '8.5.0-310p-openeuler24.03-py3.11' }}"
          echo "image=${image}" >> "${GITHUB_OUTPUT}"

      - name: Pull container image
@@ -907,7 +907,7 @@ jobs:
      - name: Set container image
        id: cann-image
        run: |
-          image="ascendai/cann:${{ matrix.chip_type == '910b' &&  '8.3.rc2-910b-openeuler24.03-py3.11' || '8.3.rc2-310p-openeuler24.03-py3.11' }}"
+          image="ascendai/cann:${{ matrix.chip_type == '910b' &&  '8.5.0-910b-openeuler24.03-py3.11' || '8.5.0-310p-openeuler24.03-py3.11' }}"
          echo "image=${image}" >> "${GITHUB_OUTPUT}"

      - name: Pull container image
@@ -42,12 +42,22 @@ The llama.cpp CANN backend is designed to support Ascend NPU. It utilize the abi

 ### Ascend NPU

-**Verified devices**
+You can retrieve your Ascend device IDs using the following command:

-| Ascend NPU                    | Status  |
-|:-----------------------------:|:-------:|
-| Atlas 300T A2                 | Support |
-| Atlas 300I Duo                | Support |
+```sh
+lspci -n | grep -Eo '19e5:d[0-9a-f]{3}' | cut -d: -f2
+```
+
+**Devices**
+
+| Device Id | Product Series | Product Models | Chip Model | Verified Status |
+|:---------:|----------------|----------------|:----------:|:---------------:|
+|    d803   | Atlas A3 Train |                |    910C    |                 |
+|    d803   | Atlas A3 Infer |                |    910C    |                 |
+|    d802   | Atlas A2 Train |                |    910B    |                 |
+|    d802   | Atlas A2 Infer | Atlas 300I A2  |    910B    |     Support     |
+|    d801   | Atlas Train    |                |     910    |                 |
+|    d500   | Atlas Infer    | Atlas 300I Duo |    310P    |     Support     |

 *Notes:*

@@ -57,6 +67,9 @@ The llama.cpp CANN backend is designed to support Ascend NPU. It utilize the abi

 ## Model Supports

+<details>
+<summary>Text-only</summary>
+
 | Model Name                  | FP16  | Q4_0 | Q8_0 |
 |:----------------------------|:-----:|:----:|:----:|
 | Llama-2                     |   √   |   √  |   √  |
@@ -118,8 +131,11 @@ The llama.cpp CANN backend is designed to support Ascend NPU. It utilize the abi
 | Trillion-7B-preview         |   √   |   √  |   √  |
 | Ling models                 |   √   |   √  |   √  |

+</details>
+
+<details>
+<summary>Multimodal</summary>

-**Multimodal**
 | Model Name                  | FP16  | Q4_0 | Q8_0 |
 |:----------------------------|:-----:|:----:|:----:|
 | LLaVA 1.5 models, LLaVA 1.6 models      |   x   |   x  |   x  |
@@ -134,15 +150,22 @@ The llama.cpp CANN backend is designed to support Ascend NPU. It utilize the abi
 |  GLM-EDGE                   |   √   |   √  |   √  |
 |  Qwen2-VL                   |   √   |   √  |   √  |

+</details>
+


 ## DataType Supports

-| DataType               | Status  |
-|:----------------------:|:-------:|
-| FP16                   | Support |
-| Q8_0                   | Support |
-| Q4_0                   | Support |
+| DataType               | 910B    | 310P    |
+|:----------------------:|:-------:|:-------:|
+| FP16                   | Support | Support |
+| Q8_0                   | Support | Partial |
+| Q4_0                   | Support | Partial |
+| BF16                   | Support |         |
+
+> **310P note**
+> - `Q8_0`: data transform / buffer path is implemented, and `GET_ROWS` is supported, but quantized `MUL_MAT` / `MUL_MAT_ID` are not supported.
+> - `Q4_0`: data transform / buffer path is implemented, but quantized `MUL_MAT` / `MUL_MAT_ID` are not supported.

 ## Docker

@@ -160,7 +183,20 @@ npu-smi info

 # Select the cards that you want to use, make sure these cards are not used by someone.
 # Following using cards of device0.
-docker run --name llamacpp --device /dev/davinci0  --device /dev/davinci_manager --device /dev/devmm_svm --device /dev/hisi_hdc -v /usr/local/dcmi:/usr/local/dcmi -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info -v /PATH_TO_YOUR_MODELS/:/app/models -it llama-cpp-cann -m /app/models/MODEL_PATH -ngl 32 -p "Building a website can be done in 10 simple steps:"
+docker run --name llamacpp \
+  --device /dev/davinci0 \
+  --device /dev/davinci_manager \
+  --device /dev/devmm_svm \
+  --device /dev/hisi_hdc \
+  -v /usr/local/dcmi:/usr/local/dcmi \
+  -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+  -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+  -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+  -v /PATH_TO_YOUR_MODELS/:/app/models \
+  -it llama-cpp-cann \
+  -m /app/models/MODEL_PATH \
+  -ngl 32 \
+  -p "Building a website can be done in 10 simple steps:"
 ```

 *Notes:*
@@ -171,69 +207,57 @@ docker run --name llamacpp --device /dev/davinci0  --device /dev/davinci_manager

 ### I. Setup Environment

-1. **Install Ascend Driver and firmware**
+1. **Configure Ascend user and group**

    ```sh
-    # create driver running user.
-    sudo groupadd -g HwHiAiUser
+    sudo groupadd HwHiAiUser
    sudo useradd -g HwHiAiUser -d /home/HwHiAiUser -m HwHiAiUser -s /bin/bash
    sudo usermod -aG HwHiAiUser $USER
-
-    # download driver from https://www.hiascend.com/hardware/firmware-drivers/community according to your system
-    # and install driver.
-    sudo sh Ascend-hdk-910b-npu-driver_x.x.x_linux-{arch}.run --full --install-for-all
    ```

-    Once installed, run `npu-smi info` to check whether driver is installed successfully.
+2. **Install dependencies**
+
+    **Ubuntu/Debian:**
    ```sh
-    +-------------------------------------------------------------------------------------------+
-    | npu-smi 24.1.rc2               Version: 24.1.rc2                                          |
-    +----------------------+---------------+----------------------------------------------------+
-    | NPU   Name           | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
-    | Chip                 | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
-    +======================+===============+====================================================+
-    | 2     xxx            | OK            | 64.4        51                15   / 15            |
-    | 0                    | 0000:01:00.0  | 0           1873 / 15077      0    / 32768         |
-    +======================+===============+====================================================+
-    | 5     xxx            | OK            | 64.0        52                15   / 15            |
-    | 0                    | 0000:81:00.0  | 0           1874 / 15077      0    / 32768         |
-    +======================+===============+====================================================+
-    | No running processes found in NPU 2                                                       |
-    +======================+===============+====================================================+
-    | No running processes found in NPU 5                                                       |
-    +======================+===============+====================================================+
+    sudo apt-get update
+    sudo apt-get install -y gcc python3 python3-pip linux-headers-$(uname -r)
    ```

-2. **Install Ascend Firmware**
+    **RHEL/CentOS:**
    ```sh
-    # download driver from https://www.hiascend.com/hardware/firmware-drivers/community according to your system
-    # and install driver.
-    sudo sh Ascend-hdk-910b-npu-firmware_x.x.x.x.X.run --full
+    sudo yum makecache
+    sudo yum install -y gcc python3 python3-pip kernel-headers-$(uname -r) kernel-devel-$(uname -r)
    ```
-    If the following message appears, firmware is installed successfully.
+
+3. **Install CANN (driver + toolkit)**
+
+    > The `Ascend-cann` package includes both the driver and toolkit.
+    > `$ARCH` can be `x86_64` or `aarch64`, `$CHIP` can be `910b` or `310p`.
+
    ```sh
-    Firmware package installed successfully!
+    wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.5.T63/Ascend-cann_8.5.0_linux-$ARCH.run
+    sudo bash ./Ascend-cann_8.5.0_linux-$ARCH.run --install
+
+    wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.5.T63/Ascend-cann-$CHIP-ops_8.5.0_linux-$ARCH.run
+    sudo bash ./Ascend-cann-$CHIP-ops_8.5.0_linux-$ARCH.run --install
    ```

+4. **Verify installation**

-3. **Install CANN toolkit and kernels**
-
-    CANN toolkit and kernels can be obtained from the official [CANN Toolkit](https://www.hiascend.com/zh/developer/download/community/result?module=cann) page.
-
-    Please download the corresponding version that satified your system. The minimum version required is 8.0.RC2.alpha002 and here is the install command.
    ```sh
-    pip3 install attrs numpy decorator sympy cffi pyyaml pathlib2 psutil protobuf scipy requests absl-py wheel typing_extensions
-    sh Ascend-cann-toolkit_8.0.RC2.alpha002_linux-aarch64.run --install
-    sh Ascend-cann-kernels-910b_8.0.RC2.alpha002_linux.run --install
+    npu-smi info
    ```

-    Set Ascend Variables:
+    If device information is displayed correctly, the driver is functioning properly.
+
    ```sh
-    echo "source ~/Ascend/ascend-toolkit/set_env.sh" >> ~/.bashrc
-    source ~/.bashrc
+    # Set environment variables (adjust path if needed)
+    source /usr/local/Ascend/cann/set_env.sh
+
+    python3 -c "import acl; print(acl.get_soc_name())"
    ```

-Upon a successful installation, CANN is enabled for the available ascend devices.
+    If the command outputs the chip model, the installation was successful.

 ### II. Build llama.cpp

@@ -690,7 +690,7 @@ ggml_metal_device_t ggml_metal_device_init(int device) {
                    "    auto tB = B.slice((int)tgid.x, 0); \n"
                    " \n"
                    "    matmul2d< \n"
-                    "        matmul2d_descriptor(8, 8, dynamic_extent), \n"
+                    "        matmul2d_descriptor(16, 16, dynamic_extent), \n"
                    "        execution_simdgroups<4>> mm; \n"
                    " \n"
                    "    auto cT = mm.get_destination_cooperative_tensor<decltype(tA), decltype(tB), float>(); \n"
@@ -740,7 +740,7 @@ ggml_metal_device_t ggml_metal_device_init(int device) {
                    "    auto tB = B.slice((int)tgid.x, 0); \n"
                    " \n"
                    "    matmul2d< \n"
-                    "        matmul2d_descriptor(8, 8, dynamic_extent), \n"
+                    "        matmul2d_descriptor(16, 16, dynamic_extent), \n"
                    "        execution_simdgroups<4>> mm; \n"
                    " \n"
                    "    auto cT = mm.get_destination_cooperative_tensor<decltype(tA), decltype(tB), float>(); \n"
@@ -589,8 +589,10 @@ static rpc_tensor serialize_tensor(const ggml_tensor * tensor) {
        ggml_backend_buffer_t buffer = tensor->buffer;
        ggml_backend_rpc_buffer_context * ctx = (ggml_backend_rpc_buffer_context *)buffer->context;
        result.buffer = ctx != nullptr ? ctx->remote_ptr : 0;
+        result.data = reinterpret_cast<uint64_t>(tensor->data);
    } else {
        result.buffer = 0;
+        result.data   = 0;
    }
    for (uint32_t i = 0; i < GGML_MAX_DIMS; i++) {
        result.ne[i] = tensor->ne[i];
@@ -606,7 +608,6 @@ static rpc_tensor serialize_tensor(const ggml_tensor * tensor) {
    }
    result.view_src = reinterpret_cast<uint64_t>(tensor->view_src);
    result.view_offs = tensor->view_offs;
-    result.data = reinterpret_cast<uint64_t>(tensor->data);

    // Avoid sending uninitialized data over the wire
    memset(result.name, 0, sizeof(result.name));
@@ -1443,9 +1444,11 @@ ggml_tensor * rpc_server::create_node(uint64_t id,
    const rpc_tensor * tensor = it_ptr->second;

    struct ggml_tensor * result = deserialize_tensor(ctx, tensor);
-    if (result == nullptr || result->buffer == nullptr) {
-        GGML_LOG_ERROR("[%s] invalid tensor: null %s (id=%" PRIu64 ")\n",
-                       __func__, result == nullptr ? "tensor" : "buffer", id);
+    if (result == nullptr) {
+        return nullptr;
+    }
+    if (result->buffer == nullptr && result->data != nullptr) {
+        GGML_LOG_ERROR("[%s] invalid data ptr", __func__);
        return nullptr;
    }
    tensor_map[id] = result;
@@ -146,13 +146,19 @@ int main(int argc, char ** argv) {

    ctx   = llama_init->context();
    model = llama_init->model();
-    smpl  = llama_init->sampler(0);

    if (ctx == NULL) {
        LOG_ERR("%s: error: unable to create context\n", __func__);
        return 1;
    }

+    if (model == NULL) {
+        LOG_ERR("%s: error: unable to load model\n", __func__);
+        return 1;
+    }
+
+    smpl = llama_init->sampler(0);
+
    llama_memory_t mem = llama_get_memory(ctx);
    const llama_vocab * vocab = llama_model_get_vocab(model);

@@ -695,7 +701,7 @@ int main(int argc, char ** argv) {
                if (!common_prompt_batch_decode(ctx, embd, n_past, params.n_batch, path_session, save_now)) {
                    return 1;
                }
-                session_tokens.insert(session_tokens.end(), embd.begin(), embd.begin());
+                session_tokens.insert(session_tokens.end(), embd.begin(), embd.end());
                n_session_consumed = session_tokens.size();
                session_do_save = false;

@@ -296,6 +296,11 @@
 					label: 'Disable reasoning content parsing',
 					type: SettingsFieldType.CHECKBOX
 				},
+				{
+					key: SETTINGS_KEYS.EXCLUDE_REASONING_FROM_CONTEXT,
+					label: 'Exclude reasoning from context',
+					type: SettingsFieldType.CHECKBOX
+				},
 				{
 					key: SETTINGS_KEYS.SHOW_RAW_OUTPUT_SWITCH,
 					label: 'Enable raw output toggle',
@@ -50,6 +50,8 @@ export const AGENTIC_REGEX = {
 	PARTIAL_MARKER: /<<<[A-Za-z_]*$/,
 	// Matches reasoning content blocks (including tags)
 	REASONING_BLOCK: /<<<reasoning_content_start>>>[\s\S]*?<<<reasoning_content_end>>>/g,
+	// Captures the reasoning text between start/end tags
+	REASONING_EXTRACT: /<<<reasoning_content_start>>>([\s\S]*?)<<<reasoning_content_end>>>/,
 	// Matches an opening reasoning tag and any remaining content (unterminated)
 	REASONING_OPEN: /<<<reasoning_content_start>>>[\s\S]*$/,
 	// Matches a complete agentic tool call display block (start to end marker)
@@ -10,6 +10,7 @@ export const SETTING_CONFIG_DEFAULT: Record<string, string | number | boolean |
 	theme: ColorMode.SYSTEM,
 	showThoughtInProgress: false,
 	disableReasoningParsing: false,
+	excludeReasoningFromContext: false,
 	showRawOutputSwitch: false,
 	keepStatsVisible: false,
 	showMessageStats: true,
@@ -106,6 +107,8 @@ export const SETTING_CONFIG_INFO: Record<string, string> = {
 	showThoughtInProgress: 'Expand thought process by default when generating messages.',
 	disableReasoningParsing:
 		'Send reasoning_format=none to prevent server-side extraction of reasoning tokens into separate field',
+	excludeReasoningFromContext:
+		'Strip reasoning content from previous messages before sending to the model. When unchecked, reasoning is sent back via the reasoning_content field so the model can see its own chain-of-thought across turns.',
 	showRawOutputSwitch:
 		'Show toggle button to display messages as plain text instead of Markdown-formatted content',
 	keepStatsVisible: 'Keep processing statistics visible after generation finishes.',
@@ -54,6 +54,7 @@ export const SETTINGS_KEYS = {
 	SHOW_TOOL_CALL_IN_PROGRESS: 'showToolCallInProgress',
 	// Developer
 	DISABLE_REASONING_PARSING: 'disableReasoningParsing',
+	EXCLUDE_REASONING_FROM_CONTEXT: 'excludeReasoningFromContext',
 	SHOW_RAW_OUTPUT_SWITCH: 'showRawOutputSwitch',
 	CUSTOM: 'custom'
 } as const;
@@ -57,6 +57,46 @@ export class ChatService {
 	 *
 	 */

+	/**
+	 * Extracts reasoning text from content that contains internal reasoning tags.
+	 * Returns the concatenated reasoning content or undefined if none found.
+	 */
+	private static extractReasoningFromContent(
+		content: ApiChatMessageData['content'] | null | undefined
+	): string | undefined {
+		if (!content) return undefined;
+
+		const extractFromString = (text: string): string => {
+			const parts: string[] = [];
+			// Use a fresh regex instance to avoid shared lastIndex state
+			const re = new RegExp(AGENTIC_REGEX.REASONING_EXTRACT.source);
+			let match = re.exec(text);
+			while (match) {
+				parts.push(match[1]);
+				// advance past the matched portion and retry
+				text = text.slice(match.index + match[0].length);
+				match = re.exec(text);
+			}
+			return parts.join('');
+		};
+
+		if (typeof content === 'string') {
+			const result = extractFromString(content);
+			return result || undefined;
+		}
+
+		if (!Array.isArray(content)) return undefined;
+
+		const parts: string[] = [];
+		for (const part of content) {
+			if (part.type === ContentPartType.TEXT && part.text) {
+				const result = extractFromString(part.text);
+				if (result) parts.push(result);
+			}
+		}
+		return parts.length > 0 ? parts.join('') : undefined;
+	}
+
 	/**
 	 * Sends a chat completion request to the llama.cpp server.
 	 * Supports both streaming and non-streaming responses with comprehensive parameter configuration.
@@ -111,7 +151,8 @@ export class ChatService {
 			custom,
 			timings_per_token,
 			// Config options
-			disableReasoningParsing
+			disableReasoningParsing,
+			excludeReasoningFromContext
 		} = options;

 		const normalizedMessages: ApiChatMessageData[] = messages
@@ -159,14 +200,24 @@ export class ChatService {
 		}

 		const requestBody: ApiChatCompletionRequest = {
-			messages: normalizedMessages.map((msg: ApiChatMessageData) => ({
-				role: msg.role,
-				// Strip reasoning tags/content from the prompt to avoid polluting KV cache.
-				// TODO: investigate backend expectations for reasoning tags and add a toggle if needed.
-				content: ChatService.stripReasoningContent(msg.content),
-				tool_calls: msg.tool_calls,
-				tool_call_id: msg.tool_call_id
-			})),
+			messages: normalizedMessages.map((msg: ApiChatMessageData) => {
+				// Always strip internal reasoning/agentic tags from content
+				const cleanedContent = ChatService.stripReasoningContent(msg.content);
+				const mapped: ApiChatCompletionRequest['messages'][0] = {
+					role: msg.role,
+					content: cleanedContent,
+					tool_calls: msg.tool_calls,
+					tool_call_id: msg.tool_call_id
+				};
+				// When preserving reasoning, extract it from raw content and send as separate field
+				if (!excludeReasoningFromContext) {
+					const reasoning = ChatService.extractReasoningFromContent(msg.content);
+					if (reasoning) {
+						mapped.reasoning_content = reasoning;
+					}
+				}
+				return mapped;
+			}),
 			stream,
 			return_progress: stream ? true : undefined,
 			tools: tools && tools.length > 0 ? tools : undefined
@@ -227,6 +227,12 @@ export const SYNCABLE_PARAMETERS: SyncableParameter[] = [
 		serverKey: 'alwaysShowAgenticTurns',
 		type: SyncableParameterType.BOOLEAN,
 		canSync: true
+	},
+	{
+		key: 'excludeReasoningFromContext',
+		serverKey: 'excludeReasoningFromContext',
+		type: SyncableParameterType.BOOLEAN,
+		canSync: true
 	}
 ];

@@ -1479,6 +1479,8 @@ class ChatStore {

 		if (currentConfig.disableReasoningParsing) apiOptions.disableReasoningParsing = true;

+		if (currentConfig.excludeReasoningFromContext) apiOptions.excludeReasoningFromContext = true;
+
 		if (hasValue(currentConfig.temperature))
 			apiOptions.temperature = Number(currentConfig.temperature);

@@ -45,6 +45,7 @@ export interface ApiErrorResponse {
 export interface ApiChatMessageData {
 	role: ChatRole;
 	content: string | ApiChatMessageContentPart[];
+	reasoning_content?: string;
 	tool_calls?: ApiChatCompletionToolCall[];
 	tool_call_id?: string;
 	timestamp?: number;
@@ -201,6 +202,9 @@ export interface ApiChatCompletionRequest {
 	messages: Array<{
 		role: ChatRole;
 		content: string | ApiChatMessageContentPart[];
+		reasoning_content?: string;
+		tool_calls?: ApiChatCompletionToolCall[];
+		tool_call_id?: string;
 	}>;
 	stream?: boolean;
 	model?: string;
@@ -24,6 +24,8 @@ export interface SettingsChatServiceOptions {
 	systemMessage?: string;
 	// Disable reasoning parsing (use 'none' instead of 'auto')
 	disableReasoningParsing?: boolean;
+	// Strip reasoning content from context before sending
+	excludeReasoningFromContext?: boolean;
 	tools?: OpenAIToolDefinition[];
 	// Generation parameters
 	temperature?: number;
@@ -0,0 +1,196 @@
+import { describe, it, expect } from 'vitest';
+import { AGENTIC_REGEX, REASONING_TAGS } from '$lib/constants/agentic';
+import { ContentPartType } from '$lib/enums';
+
+// Replicate ChatService.extractReasoningFromContent (private static)
+function extractReasoningFromContent(
+	content: string | Array<{ type: string; text?: string }> | null | undefined
+): string | undefined {
+	if (!content) return undefined;
+
+	const extractFromString = (text: string): string => {
+		const parts: string[] = [];
+		const re = new RegExp(AGENTIC_REGEX.REASONING_EXTRACT.source);
+		let match = re.exec(text);
+		while (match) {
+			parts.push(match[1]);
+			text = text.slice(match.index + match[0].length);
+			match = re.exec(text);
+		}
+		return parts.join('');
+	};
+
+	if (typeof content === 'string') {
+		const result = extractFromString(content);
+		return result || undefined;
+	}
+
+	if (!Array.isArray(content)) return undefined;
+
+	const parts: string[] = [];
+	for (const part of content) {
+		if (part.type === ContentPartType.TEXT && part.text) {
+			const result = extractFromString(part.text);
+			if (result) parts.push(result);
+		}
+	}
+	return parts.length > 0 ? parts.join('') : undefined;
+}
+
+// Replicate ChatService.stripReasoningContent (private static)
+function stripReasoningContent(
+	content: string | Array<{ type: string; text?: string }> | null | undefined
+): typeof content {
+	if (!content) return content;
+
+	if (typeof content === 'string') {
+		return content
+			.replace(AGENTIC_REGEX.REASONING_BLOCK, '')
+			.replace(AGENTIC_REGEX.REASONING_OPEN, '')
+			.replace(AGENTIC_REGEX.AGENTIC_TOOL_CALL_BLOCK, '')
+			.replace(AGENTIC_REGEX.AGENTIC_TOOL_CALL_OPEN, '');
+	}
+
+	if (!Array.isArray(content)) return content;
+
+	return content.map((part) => {
+		if (part.type !== ContentPartType.TEXT || !part.text) return part;
+		return {
+			...part,
+			text: part.text
+				.replace(AGENTIC_REGEX.REASONING_BLOCK, '')
+				.replace(AGENTIC_REGEX.REASONING_OPEN, '')
+				.replace(AGENTIC_REGEX.AGENTIC_TOOL_CALL_BLOCK, '')
+				.replace(AGENTIC_REGEX.AGENTIC_TOOL_CALL_OPEN, '')
+		};
+	});
+}
+
+// Simulate the message mapping logic from ChatService.sendMessage
+function buildApiMessage(
+	content: string,
+	excludeReasoningFromContext: boolean
+): { role: string; content: string; reasoning_content?: string } {
+	const cleaned = stripReasoningContent(content) as string;
+	const mapped: { role: string; content: string; reasoning_content?: string } = {
+		role: 'assistant',
+		content: cleaned
+	};
+	if (!excludeReasoningFromContext) {
+		const reasoning = extractReasoningFromContent(content);
+		if (reasoning) {
+			mapped.reasoning_content = reasoning;
+		}
+	}
+	return mapped;
+}
+
+// Helper: wrap reasoning the same way the chat store does during streaming
+function wrapReasoning(reasoning: string, content: string): string {
+	return `${REASONING_TAGS.START}${reasoning}${REASONING_TAGS.END}${content}`;
+}
+
+describe('reasoning content extraction', () => {
+	it('extracts reasoning from tagged string content', () => {
+		const input = wrapReasoning('step 1, step 2', 'The answer is 42.');
+		const result = extractReasoningFromContent(input);
+		expect(result).toBe('step 1, step 2');
+	});
+
+	it('returns undefined when no reasoning tags present', () => {
+		expect(extractReasoningFromContent('Just a normal response.')).toBeUndefined();
+	});
+
+	it('returns undefined for null/empty input', () => {
+		expect(extractReasoningFromContent(null)).toBeUndefined();
+		expect(extractReasoningFromContent(undefined)).toBeUndefined();
+		expect(extractReasoningFromContent('')).toBeUndefined();
+	});
+
+	it('extracts reasoning from content part arrays', () => {
+		const input = [
+			{
+				type: ContentPartType.TEXT,
+				text: wrapReasoning('thinking hard', 'result')
+			}
+		];
+		expect(extractReasoningFromContent(input)).toBe('thinking hard');
+	});
+
+	it('handles multiple reasoning blocks', () => {
+		const input =
+			REASONING_TAGS.START +
+			'block1' +
+			REASONING_TAGS.END +
+			'middle' +
+			REASONING_TAGS.START +
+			'block2' +
+			REASONING_TAGS.END +
+			'end';
+		expect(extractReasoningFromContent(input)).toBe('block1block2');
+	});
+
+	it('ignores non-text content parts', () => {
+		const input = [{ type: 'image_url', text: wrapReasoning('hidden', 'img') }];
+		expect(extractReasoningFromContent(input)).toBeUndefined();
+	});
+});
+
+describe('strip reasoning content', () => {
+	it('removes reasoning tags from string content', () => {
+		const input = wrapReasoning('internal thoughts', 'visible answer');
+		expect(stripReasoningContent(input)).toBe('visible answer');
+	});
+
+	it('removes reasoning from content part arrays', () => {
+		const input = [
+			{
+				type: ContentPartType.TEXT,
+				text: wrapReasoning('thoughts', 'answer')
+			}
+		];
+		const result = stripReasoningContent(input) as Array<{ type: string; text?: string }>;
+		expect(result[0].text).toBe('answer');
+	});
+});
+
+describe('API message building with reasoning preservation', () => {
+	const storedContent = wrapReasoning('Let me think: 2+2=4, basic arithmetic.', 'The answer is 4.');
+
+	it('preserves reasoning_content when excludeReasoningFromContext is false', () => {
+		const msg = buildApiMessage(storedContent, false);
+		expect(msg.content).toBe('The answer is 4.');
+		expect(msg.reasoning_content).toBe('Let me think: 2+2=4, basic arithmetic.');
+		// no internal tags leak into either field
+		expect(msg.content).not.toContain('<<<');
+		expect(msg.reasoning_content).not.toContain('<<<');
+	});
+
+	it('strips reasoning_content when excludeReasoningFromContext is true', () => {
+		const msg = buildApiMessage(storedContent, true);
+		expect(msg.content).toBe('The answer is 4.');
+		expect(msg.reasoning_content).toBeUndefined();
+	});
+
+	it('handles content with no reasoning in both modes', () => {
+		const plain = 'No reasoning here.';
+		const msgPreserve = buildApiMessage(plain, false);
+		const msgExclude = buildApiMessage(plain, true);
+		expect(msgPreserve.content).toBe(plain);
+		expect(msgPreserve.reasoning_content).toBeUndefined();
+		expect(msgExclude.content).toBe(plain);
+		expect(msgExclude.reasoning_content).toBeUndefined();
+	});
+
+	it('cleans agentic tool call blocks from content even when preserving reasoning', () => {
+		const input =
+			wrapReasoning('plan', 'text') +
+			'\n\n<<<AGENTIC_TOOL_CALL_START>>>\n' +
+			'<<<TOOL_NAME:bash>>>\n' +
+			'<<<TOOL_ARGS_START>>>\n{}\n<<<TOOL_ARGS_END>>>\nout\n' +
+			'<<<AGENTIC_TOOL_CALL_END>>>\n';
+		const msg = buildApiMessage(input, false);
+		expect(msg.content).not.toContain('<<<');
+		expect(msg.reasoning_content).toBe('plan');
+	});
+});
Author	SHA1	Message	Date
Radoslav Gerganov	ba38f3becc	rpc : proper handling of data pointers to CPU buffers (#21030 ) The compute graph may contain tensors pointing to CPU buffers. In these cases the buffer address is serialized as 0 and sent over the wire. However, the data pointer is serialized as-is and this prevents proper validation on the server side. This patches fixes this by serializing the data pointer as 0 for non-RPC buffers and doing proper validation on the server side. closes: #21006	2026-03-27 10:59:35 +02:00
mtmcp	37f230dd7c	completion : session_tokens insert range in completion tool (no-op → correct) (#20917 ) The embd.begin(), embd.begin() range is empty and inserts nothing, so session_tokens never gets updated after decoding. Should be embd.begin(), embd.end(). Introduced in commit `2b6dfe8`.	2026-03-27 09:25:58 +01:00
mtmcp	a308e584ca	completion : Fix segfault on model load failure (#21049 )	2026-03-27 10:01:13 +02:00
Pascal	d0fa2c9fbb	Send reasoning content back to the model across turns via the reasoning_content API field (#21036 ) * webui: send reasoning_content back to model in context Preserve assistant reasoning across turns by extracting it from internal tags and sending it as a separate reasoning_content field in the API payload. The server and Jinja templates handle native formatting (e.g. <think> tags for Qwen, GLM, DeepSeek...). Adds "Exclude reasoning from context" toggle in Settings > Developer (off by default, so reasoning is preserved). Includes unit tests. * webui: add syncable parameter for excludeReasoningFromContext * chore: update webui build output	2026-03-27 08:17:35 +01:00
ren	9bcb4eff4d	metal : Fix dimension constraint violation in matmul2d descriptor (#21048 ) Updates Metal tensor API test probe to fix the dimension constraint violation in the matmul2d descriptor (at least one value must be a multiple of 16).	2026-03-27 09:05:21 +02:00
KokerZhou	6861f6509a	CANN: update docker images to 8.5.0 and improve CANN.md (#20801 ) * cann: update docker images to 8.5.0 - bump CANN base image from 8.3.rc2 to 8.5.0 - bump ASCEND_VERSION from 8.1.RC1.alpha001 to 8.5.0 Move to newer stable releases. * cann: update CANN.md * Update CANN.md to include BF16 support Added BF16 support information to the CANN documentation and corrected formatting for the installation instructions. * Fix formatting issues in CANN.md Fix 234: Trailing whitespace	2026-03-27 08:53:00 +08:00