Sign intermediate messages for model visibility #31

Merged
code-server merged 2 commits from feat/message-visibility-signing into main 2026-03-09 18:08:37 +01:00
Collaborator

Summary

  • Intermediate assistant messages (with tool_calls) and tool results are never sent to the user but stay in the model's context, causing the model to reference content the user never saw
  • Adds _hidden_sig field to intermediate messages at creation time in context.py
  • Applies [HIDDEN:{sig}] prefix at read time in session.get_history() so the model sees which messages were hidden
  • Storing signature separately from content preserves Anthropic prompt caching — same prefixed string produced every turn

Files changed

  • nanobot/agent/visibility.py — new compute_signature() function (returns hex only)
  • nanobot/agent/context.pyadd_assistant_message() and add_tool_result() store _hidden_sig
  • nanobot/session/manager.pyget_history() applies [HIDDEN:sig] prefix at read time

Test plan

  • All 277 tests pass
  • Deploy to staging and verify model stops referencing unseen messages
  • Verify prompt cache hit rate is not degraded
## Summary - Intermediate assistant messages (with tool_calls) and tool results are never sent to the user but stay in the model's context, causing the model to reference content the user never saw - Adds `_hidden_sig` field to intermediate messages at creation time in `context.py` - Applies `[HIDDEN:{sig}]` prefix at read time in `session.get_history()` so the model sees which messages were hidden - Storing signature separately from content preserves Anthropic prompt caching — same prefixed string produced every turn ## Files changed - `nanobot/agent/visibility.py` — new `compute_signature()` function (returns hex only) - `nanobot/agent/context.py` — `add_assistant_message()` and `add_tool_result()` store `_hidden_sig` - `nanobot/session/manager.py` — `get_history()` applies `[HIDDEN:sig]` prefix at read time ## Test plan - [x] All 277 tests pass - [ ] Deploy to staging and verify model stops referencing unseen messages - [ ] Verify prompt cache hit rate is not degraded
code-server added 1 commit 2026-03-09 07:56:19 +01:00
feat: sign intermediate messages so model knows what user didn't see
Build Nanobot OAuth / build (pull_request) Successful in 5m56s
Build Nanobot OAuth / cleanup (pull_request) Has been skipped
32faed5c1c
Intermediate assistant messages (with tool_calls) and tool result messages
are never sent to the user but remain in the model's context. This causes
the model to refer to content the user never saw.

Add _hidden_sig field at message creation time (context.py), then apply
[HIDDEN:sig] prefix at read time (session get_history) so the model sees
which messages were hidden. Storing the signature separately from content
preserves Anthropic prompt caching — the same prefixed string is produced
every turn.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
code-server force-pushed feat/message-visibility-signing from 32faed5c1c to d90c3b4a24 2026-03-09 15:17:39 +01:00 Compare
Author
Collaborator

Code Review

Design

The split between write-time signature storage (_hidden_sig) and read-time prefix application (get_history()) is a good call for prompt caching — content bytes in the session file stay stable, and Anthropic's cache sees identical prefixes across turns.

No double-signing risk: suppress_output messages go through sign_content (no tool_calls → no _hidden_sig), while intermediate messages go through _hidden_sig (never through sign_content). The two paths don't overlap.

Issues to fix before merge

1. compute_signature duplicates sign_content internals

sign_content() (visibility.py:31-35) has its own inline HMAC computation identical to the new compute_signature(). Two copies of the same HMAC logic will drift. Refactor:

def sign_content(content: str) -> str:
    sig = compute_signature(content)
    return f"[HIDDEN:{sig}] {content}"

2. No tests for the actual changes in this PR

All 15 existing tests cover the pre-existing sign_content/suppress_output system. None test what this PR changes:

  • compute_signature() directly
  • _hidden_sig being added by add_tool_result() and add_assistant_message(tool_calls=...)
  • get_history() applying [HIDDEN:{sig}] prefix when _hidden_sig is present
  • Messages WITHOUT _hidden_sig NOT getting prefixed
  • Round-trip: write to session JSONL → reload → get_history() still produces correct prefix

Optional improvements

3. _hidden_sig not verified at read time

In get_history() (manager.py:73-75), sig from _hidden_sig is interpolated directly into the prefix without checking it matches the content. In contrast, the suppress_output path verifies via has_forged_marker(). Low priority since session files are local, but worth noting the asymmetry.

4. All tool results get _hidden_sig unconditionally

add_tool_result() always adds _hidden_sig. When a user says "read file X", the tool result is marked [HIDDEN] even though the user explicitly requested that action. If intentional, the system prompt visibility section should clarify: "[HIDDEN] means the raw result wasn't sent verbatim, not that the user is unaware of the action."

5. Minor: Tuple import (pre-existing)

visibility.py:7 uses from typing import Tuple — project convention is tuple lowercase (Python 3.11+). Not introduced by this PR but worth cleaning up while touching the file.

## Code Review ### Design The split between write-time signature storage (`_hidden_sig`) and read-time prefix application (`get_history()`) is a good call for prompt caching — content bytes in the session file stay stable, and Anthropic's cache sees identical prefixes across turns. No double-signing risk: suppress_output messages go through `sign_content` (no tool_calls → no `_hidden_sig`), while intermediate messages go through `_hidden_sig` (never through `sign_content`). The two paths don't overlap. ### Issues to fix before merge **1. `compute_signature` duplicates `sign_content` internals** `sign_content()` (visibility.py:31-35) has its own inline HMAC computation identical to the new `compute_signature()`. Two copies of the same HMAC logic will drift. Refactor: ```python def sign_content(content: str) -> str: sig = compute_signature(content) return f"[HIDDEN:{sig}] {content}" ``` **2. No tests for the actual changes in this PR** All 15 existing tests cover the pre-existing `sign_content`/suppress_output system. None test what this PR changes: - `compute_signature()` directly - `_hidden_sig` being added by `add_tool_result()` and `add_assistant_message(tool_calls=...)` - `get_history()` applying `[HIDDEN:{sig}]` prefix when `_hidden_sig` is present - Messages WITHOUT `_hidden_sig` NOT getting prefixed - Round-trip: write to session JSONL → reload → `get_history()` still produces correct prefix ### Optional improvements **3. `_hidden_sig` not verified at read time** In `get_history()` (manager.py:73-75), `sig` from `_hidden_sig` is interpolated directly into the prefix without checking it matches the content. In contrast, the suppress_output path verifies via `has_forged_marker()`. Low priority since session files are local, but worth noting the asymmetry. **4. All tool results get `_hidden_sig` unconditionally** `add_tool_result()` always adds `_hidden_sig`. When a user says "read file X", the tool result is marked `[HIDDEN]` even though the user explicitly requested that action. If intentional, the system prompt visibility section should clarify: "[HIDDEN] means the raw result wasn't sent verbatim, not that the user is unaware of the action." **5. Minor: `Tuple` import (pre-existing)** `visibility.py:7` uses `from typing import Tuple` — project convention is `tuple` lowercase (Python 3.11+). Not introduced by this PR but worth cleaning up while touching the file.
code-server added 1 commit 2026-03-09 15:24:02 +01:00
feat: sign intermediate messages so model knows what user didn't see
Build Nanobot OAuth / build (pull_request) Successful in 6m14s
Build Nanobot OAuth / cleanup (pull_request) Has been skipped
5569c99b8e
Intermediate assistant messages (with tool_calls) and tool result messages
are never sent to the user but remain in the model's context. This causes
the model to refer to content the user never saw.

Add _hidden_sig field at message creation time (context.py), then apply
[HIDDEN:sig] prefix at read time (session get_history) so the model sees
which messages were hidden. Storing the signature separately from content
preserves Anthropic prompt caching — the same prefixed string is produced
every turn.

Changes:
- visibility.py: add compute_signature(), refactor sign_content/verify to
  use it, fix Tuple -> tuple (PEP 585)
- context.py: add_assistant_message() and add_tool_result() store _hidden_sig
- session/manager.py: get_history() applies [HIDDEN:sig] prefix at read time
- tests/test_message_visibility.py: 14 tests covering compute_signature,
  _hidden_sig creation, get_history prefix, JSONL round-trip, idempotency

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
code-server merged commit 35eb35cdc2 into main 2026-03-09 18:08:37 +01:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: wylab/nanobot#31