feat: research-manager session 2026-05-05, staging dirs initialized

2026-05-05 22:56:43 +02:00
parent ad3e1b526a
commit 8bfaf7d31c
11 changed files with 509 additions and 0 deletions
@@ -0,0 +1,39 @@
+observations:
+
+  - id: O01
+    timestamp: "2026-05-05T22:54"
+    provenance: ai-suggested
+    content: >
+      ARA's structured layer separation (logic / src / trace / evidence / staging) combined with
+      Seal L1/L2 validation enables machine-auditable rigor that unstructured HISTORY.md + KNOWLEDGE.md
+      cannot provide. The compiler's 145+ check suite on nanobot ARA and 22-file traefik ARA both
+      passing L1 on first run suggests the format is viable for operational agent projects, not only
+      academic papers.
+    context: >
+      Session 2026-05-05: ARA protocol adopted for WyLab, both nanobot and traefik-infrastructure
+      ARAs compiled and Seal L1 validated in the same session. Observation arose from the compiler
+      run results and the decision to adopt ARA system-wide.
+    potential_type: claim
+    bound_to: [N20, N24, N25]
+    promoted: false
+    promoted_to: null
+    crystallized_via: null
+    stale: false
+
+  - id: O02
+    timestamp: "2026-05-05T22:54"
+    provenance: user
+    content: >
+      SMB guest access (no credentials, //192.168.1.50/ara) is the viable method for nanobot to
+      mount Unraid network storage when SSH pubkey is not provisioned and NFS is not enabled.
+      Guest SMB does not require any credential management and works immediately once the share
+      is created in the Unraid UI.
+    context: >
+      Session 2026-05-05: SSH (.50, pubkey required) and NFS (not enabled) both failed as mount
+      options. SMB guest access succeeded and was used to mount the ara share.
+    potential_type: constraint
+    bound_to: [N22, N23]
+    promoted: false
+    promoted_to: null
+    crystallized_via: null
+    stale: false
@@ -182,3 +182,109 @@ tree:
            hypothesis: "Adding 1.1.1.1 as the runner's DNS server, or switching to host network mode, would resolve git.wylab.me from within CI/CD runner containers."
            failure_mode: "Adding 1.1.1.1 as DNS did not work (runner containers still couldn't resolve internal Gitea domain). Host network mode partially worked (1/6 jobs succeeded) but was not reproducible. Root cause was not DNS at all — it was .NET build cache corruption from the macOS ARM64 OrbStack runner sharing a cache with the x64 Linux runner. Architecture-incompatible cached binaries caused cryptic build failures that looked like DNS or network errors."
            lesson: "Mixed-architecture CI/CD runners must use separate, isolated build caches. Architecture-specific cache keys prevent cross-contamination. The DNS red herring wasted multiple days of debugging — always verify the failure mode before trying infrastructure fixes."
+
+  - id: N20
+    type: decision
+    provenance: user
+    timestamp: "2026-05-05T22:54"
+    title: "Adopt ARA (Agent-Native Research Artifact) format for all WyLab projects"
+    choice: >
+      Discovered the ARA protocol from Orchestra-Research and decided to adopt it as the standard
+      structured artifact format for all WyLab projects. ARA enforces progressive crystallization,
+      provenance tracking, and machine-readable layer separation (logic / src / trace / evidence /
+      staging), enabling rigor auditing and structured compaction.
+    alternatives:
+      - "Continue with unstructured HISTORY.md + KNOWLEDGE.md only (rejected — no provenance, no structured claims layer)"
+      - "Custom internal documentation format (rejected — ARA already exists and has compiler + rigor tooling)"
+    evidence: ["Discovery of Orchestra-Research ARA protocol", "Three ARA skills available: ara-compiler, ara-research-manager, ara-rigor-reviewer"]
+    status: resolved
+    children:
+
+      - id: N21
+        type: decision
+        provenance: user
+        timestamp: "2026-05-05T22:54"
+        title: "Create ARA repo at git.wylab.me/nanobot/ara and install three ARA skills"
+        choice: >
+          ARA repository initialized at git.wylab.me/nanobot/ara. Three ARA skills installed into
+          nanobot workspace: ara-compiler (Seal L1/L2 validation + compilation), ara-research-manager
+          (per-turn progressive crystallization epilogue), ara-rigor-reviewer (L2 structural review).
+          research-manager wired into KNOWLEDGE.md compaction protocol as mandatory pre-compaction step.
+        alternatives:
+          - "Store ARA artifacts locally only without a dedicated repo (rejected — no versioning or sharing)"
+        evidence: ["N20"]
+        status: resolved
+
+      - id: N22
+        type: dead_end
+        provenance: user
+        timestamp: "2026-05-05T22:54"
+        title: "Unraid LAN IP was .78 (wrong) — SSH pubkey required — NFS not enabled"
+        hypothesis: >
+          Unraid server reachable at 192.168.1.78; SSH accessible with password; NFS available for
+          mounting the ara share.
+        failure_mode: >
+          Unraid LAN IP is 192.168.1.50, not .78 (KNOWLEDGE.md was stale). SSH login requires
+          pubkey authentication — password auth not accepted. NFS share not enabled on Unraid.
+          All three assumptions were wrong simultaneously; prior KNOWLEDGE.md entry for .78 must
+          be corrected.
+        lesson: >
+          Always verify Unraid IP from a live source before scripting mounts. SSH pubkey must be
+          provisioned before any automated SSH-based tasks can run against Unraid. NFS requires
+          explicit enablement in Unraid UI; do not assume it is on. SMB guest access is the
+          available path for unauthenticated mounts.
+        status: resolved
+
+      - id: N23
+        type: decision
+        provenance: user
+        timestamp: "2026-05-05T22:54"
+        title: "Mount Unraid ara share via SMB guest access at //192.168.1.50/ara"
+        choice: >
+          Created ara SMB share on Unraid and mounted it at //192.168.1.50/ara using SMB guest
+          access (no credentials). This is the operative method for nanobot to read/write compiled
+          ARA artifacts to network storage after SSH and NFS were ruled out.
+        alternatives:
+          - "SSH-based file transfer (ruled out — pubkey not provisioned)"
+          - "NFS mount (ruled out — NFS not enabled on Unraid)"
+          - "Manual file copy (rejected — not automatable)"
+        evidence: ["N22"]
+        status: resolved
+
+      - id: N24
+        type: experiment
+        provenance: ai-executed
+        timestamp: "2026-05-05T22:54"
+        title: "Compile nanobot ARA — 30 files, 145+ Seal L1 checks pass"
+        result: >
+          ara-compiler ran against the nanobot ARA. Output: 30 files compiled, 145+ Seal L1
+          structural/provenance checks passed. No L1 failures. Artifact validated as structurally
+          conformant to ARA spec.
+        evidence: ["ara-compiler Seal L1 output, 2026-05-05"]
+        status: resolved
+
+      - id: N25
+        type: experiment
+        provenance: ai-executed
+        timestamp: "2026-05-05T22:54"
+        title: "Compile traefik-infrastructure ARA — 22 files, Seal L1 validated"
+        result: >
+          ara-compiler ran against the traefik-infrastructure ARA. Output: 22 files compiled,
+          Seal L1 validation passed. Second WyLab ARA successfully onboarded to the format.
+        evidence: ["ara-compiler Seal L1 output, 2026-05-05"]
+        status: resolved
+
+      - id: N26
+        type: decision
+        provenance: user
+        timestamp: "2026-05-05T22:54"
+        title: "Wire ara-research-manager into KNOWLEDGE.md compaction protocol"
+        choice: >
+          Added ara-research-manager as a mandatory step in KNOWLEDGE.md's compaction protocol.
+          Before any compaction run, the research manager epilogue must be executed to ensure all
+          staged observations and trace events are committed to the ARA. Prevents knowledge loss
+          at compaction boundaries.
+        alternatives:
+          - "Run research-manager ad hoc only when remembered (rejected — prone to gaps at compaction)"
+        evidence: ["N20", "N21"]
+        status: resolved
@@ -0,0 +1,13 @@
+entries:
+  - turn: "2026-05-05_001#1"
+    notes:
+      - "Routed N20 (ARA adoption) as decision/direct — user explicitly chose ARA over alternatives; clear journey fact."
+      - "Routed N22 as dead_end/direct rather than three separate dead_ends — all three failures (wrong IP, SSH pubkey, NFS) are causally linked and discovered in the same investigative thread; bundling avoids fragmentation."
+      - "Routed N23 (SMB guest mount) as decision/direct — user chose this after N22 eliminated alternatives; has clear evidence binding."
+      - "Routed N24/N25 as experiments/direct — compiler runs produced quantitative results (file counts, check counts); these are empirical facts, not interpretations."
+      - "Staged O01 as potential_type: claim (not direct) — 'ARA enables machine-auditable rigor' is an interpretive assertion about format capability, not a journey fact. Needs at least one session of use before it qualifies for any closure signal."
+      - "Staged O02 as potential_type: constraint (not direct) — SMB-as-workaround is a boundary condition about what works given absent SSH pubkey and NFS. User stated it as fact (provenance: user) but it hasn't yet been tested under load or across reboots."
+      - "Did NOT crystallize O01 or O02 this turn — no closure signal present. Verbal-affirmation would require explicit 'yes, that's confirmed' from user; topic-abandonment requires 5 turns idle; artifact-commitment requires a downstream entry citing them."
+      - "Noted KNOWLEDGE.md IP correction (.78→.50) as open thread — not logging a new node for this since it's a metadata correction, not a new research event. The dead_end N22 captures the lesson."
+      - "No prior staged observations existed (staging/observations.yaml was empty) — no maturity tracking needed this turn."
+      - "exploration_tree.yaml had no existing N20+ nodes — assigned N20–N26 sequentially. No ID conflicts."
@@ -0,0 +1,125 @@
+session:
+  id: "2026-05-05_001"
+  date: "2026-05-05"
+  started: "2026-05-05T22:54"
+  last_turn: "2026-05-05T22:54"
+  turn_count: 1
+  summary: "ARA protocol adopted for WyLab; nanobot + traefik-infrastructure ARAs compiled and Seal L1 validated; Unraid ara SMB share mounted; research-manager wired into compaction protocol; Unraid IP corrected from .78 to .50."
+
+events_logged:
+  - turn: 1
+    type: decision
+    id: "N20"
+    routing: direct
+    provenance: user
+    summary: "Adopt ARA format for all WyLab projects; discovered Orchestra-Research ARA protocol"
+
+  - turn: 1
+    type: decision
+    id: "N21"
+    routing: direct
+    provenance: user
+    summary: "ARA repo created at git.wylab.me/nanobot/ara; three ARA skills installed (ara-compiler, ara-research-manager, ara-rigor-reviewer)"
+
+  - turn: 1
+    type: dead_end
+    id: "N22"
+    routing: direct
+    provenance: user
+    summary: "Unraid IP was .78 (stale) — correct is .50; SSH requires pubkey; NFS not enabled — all three access assumptions wrong"
+
+  - turn: 1
+    type: decision
+    id: "N23"
+    routing: direct
+    provenance: user
+    summary: "Mount Unraid ara share via SMB guest access at //192.168.1.50/ara"
+
+  - turn: 1
+    type: experiment
+    id: "N24"
+    routing: direct
+    provenance: ai-executed
+    summary: "nanobot ARA compiled: 30 files, 145+ Seal L1 checks pass"
+
+  - turn: 1
+    type: experiment
+    id: "N25"
+    routing: direct
+    provenance: ai-executed
+    summary: "traefik-infrastructure ARA compiled: 22 files, Seal L1 validated"
+
+  - turn: 1
+    type: decision
+    id: "N26"
+    routing: direct
+    provenance: user
+    summary: "research-manager wired into KNOWLEDGE.md compaction protocol as mandatory pre-compaction step"
+
+  - turn: 1
+    type: observation
+    id: "O01"
+    routing: staged
+    provenance: ai-suggested
+    summary: "ARA structured layers + Seal L1 validation enables machine-auditable rigor not achievable with unstructured files (potential_type: claim)"
+
+  - turn: 1
+    type: observation
+    id: "O02"
+    routing: staged
+    provenance: user
+    summary: "SMB guest access is the viable Unraid mount method when SSH pubkey absent and NFS disabled (potential_type: constraint)"
+
+ai_actions:
+  - turn: 1
+    action: "Read SKILL.md, event-taxonomy.md, existing ara/ files for current state"
+    provenance: ai-executed
+    files_changed: []
+
+  - turn: 1
+    action: "Appended N20–N26 to trace/exploration_tree.yaml (7 nodes: 4 decisions, 1 dead_end, 2 experiments)"
+    provenance: ai-executed
+    files_changed: ["trace/exploration_tree.yaml"]
+
+  - turn: 1
+    action: "Wrote staging/observations.yaml with O01 (claim candidate) and O02 (constraint candidate)"
+    provenance: ai-executed
+    files_changed: ["staging/observations.yaml"]
+
+  - turn: 1
+    action: "Created trace/sessions/2026-05-05_001.yaml (this file)"
+    provenance: ai-executed
+    files_changed: ["trace/sessions/2026-05-05_001.yaml"]
+
+  - turn: 1
+    action: "Updated trace/sessions/session_index.yaml with 2026-05-05_001 entry"
+    provenance: ai-executed
+    files_changed: ["trace/sessions/session_index.yaml"]
+
+  - turn: 1
+    action: "Appended entry to trace/pm_reasoning_log.yaml"
+    provenance: ai-executed
+    files_changed: ["trace/pm_reasoning_log.yaml"]
+
+claims_touched: []
+
+key_context:
+  - turn: 1
+    excerpt: >
+      "Discovery of the ARA (Agent-Native Research Artifact) protocol from Orchestra-Research.
+      Decision to adopt ARA format for all WyLab projects. ARA repo created at git.wylab.me/nanobot/ara.
+      Three ARA skills installed. Unraid ara SMB share created and mounted at //192.168.1.50/ara.
+      nanobot ARA compiled (30 files, 145+ Seal L1 checks pass). traefik-infrastructure ARA compiled
+      (22 files, Seal L1 validated). research-manager wired into compaction protocol in KNOWLEDGE.md.
+      Dead ends: Unraid LAN at 192.168.1.50 (not .78 as previously in KNOWLEDGE.md), SSH requires
+      pubkey, NFS not enabled, SMB guest access works."
+
+open_threads:
+  - "Unraid SSH pubkey not yet provisioned — blocks automated SSH-based tasks against Unraid"
+  - "NFS not enabled on Unraid — SMB guest is current workaround; may want NFS for performance later"
+  - "ara-rigor-reviewer (L2) not yet run on either ARA — only L1 validated so far"
+  - "O01 (ARA rigor claim) and O02 (SMB constraint) staged but not yet crystallized — await closure signals"
+  - "KNOWLEDGE.md .78 IP entry should be corrected to .50 if not already done"
+
+ai_suggestions_pending:
+  - "O01: ARA structured layers enable machine-auditable rigor — staged as potential claim, not yet affirmed"
@@ -0,0 +1,8 @@
+sessions:
+  - id: "2026-05-05_001"
+    date: "2026-05-05"
+    summary: "ARA protocol adopted for WyLab; nanobot + traefik ARAs Seal L1 compiled; Unraid SMB ara share mounted; Unraid IP corrected .78→.50; research-manager wired into compaction protocol"
+    turn_count: 1
+    events_count: 9
+    claims_touched: []
+    open_threads: 5
@@ -0,0 +1,92 @@
+---
+title: "SS14 CI/CD Pipeline: Gitea Actions Build System for wylab-station-14 on Unraid"
+authors: ["Makar Novozhilov (wylab)"]
+year: 2025
+venue: "Internal Engineering Notes"
+doi: "Not applicable — internal project"
+ara_version: "1.0"
+domain: "CI/CD Infrastructure / Self-Hosted DevOps"
+keywords:
+  - gitea-actions
+  - ci-cd
+  - space-station-14
+  - docker
+  - unraid
+  - act-runner
+  - dotnet
+  - cache-corruption
+  - arm64
+  - x86-64
+claims_summary:
+  - "Mixed-architecture runners cause silent cache corruption: arm64 and x86-64 runners share incompatible .NET build cache entries, producing wrong artifacts without build errors."
+  - "Local file cache outperforms native Gitea remote cache: Gitea's built-in cache server times out under load (ETIMEDOUT on port 39913), while local file cache on the runner host is reliable."
+  - "Gitea runner DNS resolution fails in container-network mode: runner job containers cannot resolve internal hostnames (git.wylab.me) without host networking or external DNS, causing pipeline non-triggers."
+  - "OOM kills dominate under parallel dotnet builds: concurrent job capacity must be capped at 2 for dotnet workloads on a 32GB Unraid host to avoid out-of-memory crashes."
+  - "Pinning builds to a single runner architecture eliminates cross-arch cache corruption entirely."
+abstract: |
+  wylab-station-14 is a fork of space-wizards/space-station-14 (Space Station 14 game server)
+  run as a Docker container on an Unraid homelab server (UM790 Pro, 32GB RAM). The CI/CD pipeline
+  is implemented via Gitea Actions on a self-hosted Gitea instance (git.wylab.me) using act-runner.
+  This ARA documents the engineering decisions, failure modes, and dead ends encountered while
+  building and stabilizing this pipeline across three runner configurations: an Unraid-hosted
+  container runner, an external VPS runner (45.137.68.83), and a macOS ARM64 runner via OrbStack.
+  The most critical finding is silent cache corruption from mixed-architecture runners — arm64
+  macOS and x86-64 Unraid runners sharing cache entries leads to architecturally incompatible
+  build artifacts without any explicit build failure. The solution is architecture-tagged cache
+  keys or strict runner pinning. Secondary findings cover DNS resolution strategies, OOM capacity
+  limits for dotnet workloads, and remote vs. local cache reliability.
+---
+
+# SS14 CI/CD Pipeline: Gitea Actions on Unraid
+
+## Overview
+
+wylab-station-14 is a self-hosted Space Station 14 (SS14) game server running as a Docker
+container on an Unraid homelab (UM790 Pro, 32GB RAM, 20+ Docker containers). The build pipeline
+uses Gitea Actions (git.wylab.me) with act-runner to build the server Docker image on every
+commit to the wylab/wylab-station-14 repository (fork of space-wizards/space-station-14).
+
+The project encountered a series of progressively subtler failures across three distinct runner
+configurations. The most dangerous dead end is **cache architecture mismatch**: when both an
+arm64 macOS runner (OrbStack) and an x86-64 Unraid runner share a cache backend, .NET build
+artifacts are written and read across architectures. The build does not fail explicitly — it
+produces wrong binaries silently. The fix is architecture-tagged cache keys (e.g.,
+`cache-key: dotnet-{{ arch }}-{{ hashFiles('**/*.csproj') }}`) or strict runner label pinning
+so only one architecture ever executes a given workflow.
+
+## Layer Index
+
+### Cognitive Layer (`/logic`)
+| File | Description |
+|------|-------------|
+| [problem.md](logic/problem.md) | Observations → gaps → key insight (DNS, OOM, cache) |
+| [claims.md](logic/claims.md) | 5 falsifiable claims (C01–C05) |
+| [concepts.md](logic/concepts.md) | 7 key concepts: act-runner, cache key, runner label pinning, etc. |
+| [experiments.md](logic/experiments.md) | 4 declarative verification plans (E01–E04) |
+| [solution/architecture.md](logic/solution/architecture.md) | Pipeline component graph |
+| [solution/algorithm.md](logic/solution/algorithm.md) | Build workflow logic + pseudocode |
+| [solution/constraints.md](logic/solution/constraints.md) | Boundary conditions and limitations |
+| [solution/heuristics.md](logic/solution/heuristics.md) | 5 operational heuristics (H01–H05) |
+| [related_work.md](logic/related_work.md) | Related tools and systems (Gitea, act-runner, OrbStack) |
+
+### Physical Layer (`/src`)
+| File | Description | Claims |
+|------|-------------|--------|
+| [configs/training.md](src/configs/training.md) | Runner config parameters (capacity, timeout, cache) | C01, C04 |
+| [configs/model.md](src/configs/model.md) | Workflow YAML configuration patterns | C01, C02, C03 |
+| [execution/cache_key_strategy.py](src/execution/cache_key_strategy.py) | Architecture-tagged cache key generation stub | C01 |
+| [environment.md](src/environment.md) | Build stack: .NET, Node.js, Docker, Gitea runner version | All |
+
+### Exploration Graph (`/trace`)
+| File | Description |
+|------|-------------|
+| [exploration_tree.yaml](trace/exploration_tree.yaml) | 11-node research DAG: 3 dead ends, 3 decisions |
+
+### Evidence (`/evidence`)
+| File | Description |
+|------|-------------|
+| [README.md](evidence/README.md) | Full index of 4 tables |
+| [tables/table1_runner_configurations.md](evidence/tables/table1_runner_configurations.md) | Runner configs tried, outcome per config |
+| [tables/table2_cache_failure_modes.md](evidence/tables/table2_cache_failure_modes.md) | Cache strategies and their failure modes |
+| [tables/table3_capacity_oom_progression.md](evidence/tables/table3_capacity_oom_progression.md) | OOM-driven capacity reduction sequence |
+| [tables/table4_dns_approaches.md](evidence/tables/table4_dns_approaches.md) | DNS resolution approaches and outcomes |
@@ -0,0 +1,51 @@
+# Claims
+
+## C01: Mixed-architecture runners cause silent cache corruption
+- **Statement**: When act-runner jobs run on both arm64 (macOS OrbStack) and x86-64 (Unraid) hosts and share a cache backend with architecture-agnostic keys, .NET build artifacts are written and read across architectures, producing wrong binaries without any explicit build error.
+- **Status**: supported
+- **Falsification criteria**: A pipeline using architecture-agnostic cache keys across arm64 and x86-64 runners consistently produces correct x86-64 binaries AND cache hit rates are above 80%. If this holds, the claim is refuted.
+- **Proof**: [E01, E02]
+- **Evidence basis**: Session logs from 2025-12-18 and 2025-12-19 document intermittent unexplained failures on the Mac ARM64 runner that were traced to cache entries written by one architecture being consumed by the other. Explicit documentation: "mixed-architecture runners caused cache entries to be architecture-incompatible → builds fail silently with wrong artifacts."
+- **Interpretation**: This is the most dangerous failure mode because it produces green CI status with broken output. DNS failures (C03) and OOM crashes (C04) are at least visible.
+- **Dependencies**: None
+- **Tags**: cache-corruption, arm64, x86-64, act-runner, silent-failure, dotnet
+
+## C02: Local file cache outperforms native Gitea remote cache for act-runner
+- **Statement**: Gitea's native act-cache-server (remote cache protocol) is unreliable for this workload: connections time out (ETIMEDOUT on port 39913) causing cache misses, while local file cache on the runner host is stable and provides consistent cache hits.
+- **Status**: supported
+- **Falsification criteria**: A deployment using native Gitea act-cache-server with correct network configuration achieves <10s cache step latency consistently across 10+ builds. If so, the claim would need qualification.
+- **Proof**: [E03]
+- **Evidence basis**: Session log 2025-12-15 documents ETIMEDOUT on 45.137.68.83:39913 for the external runner's cache server connection. The .NET cache step took 5 minutes (vs 5 seconds for other steps) indicating a full cache miss on every run. Session 2025-12-19 documents successful switch to local file cache.
+- **Interpretation**: The ETIMEDOUT may be specific to the external VPS network topology (port forwarding, firewall). However, local file cache avoids network entirely and is the simpler and more reliable option for single-runner setups.
+- **Dependencies**: None
+- **Tags**: cache, act-cache-server, local-file-cache, performance, reliability
+
+## C03: Gitea runner job containers cannot resolve internal hostnames in container-network mode
+- **Statement**: act-runner job containers spawned in default Docker bridge network mode cannot resolve private hostnames (e.g., git.wylab.me) because they inherit Docker's bridge DNS, not the host's resolver. Adding external DNS (1.1.1.1) to the runner process does not fix this; only host networking mode partially resolves it.
+- **Status**: supported
+- **Falsification criteria**: A runner configuration using container-network mode (non-host) with a custom DNS entry pointing to the Unraid host's internal resolver successfully resolves git.wylab.me in all 6/6 job containers.
+- **Proof**: [E04]
+- **Evidence basis**: Session log 2025-12-14 documents: DNS resolution failures in runner containers, 1.1.1.1 DNS attempt failed, host networking mode yielded 1/6 jobs succeeding. Pattern matches Docker bridge DNS isolation: job containers get bridge-network DNS, not host DNS.
+- **Interpretation**: The proper fix is to configure act-runner's `container.network` or `container.dns` settings in config.yml to point job containers at the Technitium DNS server (192.168.1.50) or the Docker bridge gateway (172.17.0.1) which can forward to Technitium.
+- **Dependencies**: None
+- **Tags**: dns, act-runner, container-networking, gitea, internal-hostname
+
+## C04: Concurrent dotnet build capacity must be capped at 2 for this hardware
+- **Statement**: Running more than 2 concurrent dotnet build jobs on the macOS ARM64 runner (OrbStack, no swap) causes out-of-memory crashes. The stable operating point is capacity=2.
+- **Status**: supported
+- **Falsification criteria**: The runner operates at capacity=3 or higher for 20+ consecutive builds without OOM crash or significant slowdown.
+- **Proof**: [E03]
+- **Evidence basis**: Session log 2025-12-19 explicitly documents the capacity reduction sequence: started at 6 → 4 → 3 → 2 concurrent jobs, driven by OOM with dotnet builds. OrbStack note: swap not available (macOS manages memory).
+- **Interpretation**: This limit is specific to the combination of OrbStack (no swap), dotnet's build server memory model, and SS14's codebase size. A Linux runner with swap enabled might sustain higher concurrency.
+- **Dependencies**: None
+- **Tags**: oom, capacity, dotnet, orbstack, concurrency
+
+## C05: Architecture-tagged cache keys or runner label pinning eliminates cross-arch cache corruption
+- **Statement**: Including the runner architecture in the cache key (e.g., `dotnet-{{ runner.arch }}-{{ hashFiles('**/*.csproj') }}`) or restricting all builds to a single architecture via runner labels ensures cache entries are never shared across incompatible architectures.
+- **Status**: hypothesis
+- **Falsification criteria**: A pipeline using architecture-tagged keys produces a cache collision (arm64 cache entry consumed by x86-64 job). This would require a bug in the cache key hashing — very unlikely but theoretically possible.
+- **Proof**: [E01, E02]
+- **Evidence basis**: The fix is derived from the failure mode in C01. No systematic A/B test was run comparing keyed vs. unkeyed caches — the fix was proposed based on root cause analysis and is standard practice in cross-platform CI.
+- **Interpretation**: Architecture-tagged keys are the minimal fix. Runner label pinning (e.g., `runs-on: unraid`) is the belt-and-suspenders approach that also eliminates OOM risk from ARM64 runner and simplifies the build environment to match the deployment target.
+- **Dependencies**: C01
+- **Tags**: cache-key, runner-pinning, architecture, fix, prevention
@@ -0,0 +1,75 @@
+# Problem Specification
+
+## Observations
+
+### O1: Pipeline non-triggering on commits
+- **Statement**: After initial setup, the Gitea Actions pipeline for wylab-station-14 did not trigger on commits to git.wylab.me. The runner was registered but jobs were never dispatched.
+- **Evidence**: Session log 2025-12-14; multiple failed trigger attempts documented.
+- **Implication**: Pipeline cannot be used at all until runners can communicate with the Gitea server.
+
+### O2: DNS resolution failure inside runner job containers
+- **Statement**: Runner job containers (spawned by act-runner for each workflow job) could not resolve the internal hostname `git.wylab.me`. Adding `1.1.1.1` as DNS to the runner process or its app container did not fix the issue. Host networking mode partially worked: 1 out of 6 jobs succeeded.
+- **Evidence**: Session log 2025-12-14; host-network experiment with 1/6 success rate.
+- **Implication**: act-runner spawns separate containers per job; DNS config must propagate to those job containers specifically, not just the runner process container.
+
+### O3: External VPS runner had persistent Node.js module cache corruption
+- **Statement**: The external runner on 45.137.68.83 repeatedly failed with "Cannot find module" errors in `/opt/gitea-runner/.cache/act/`. Clearing the cache directory and restarting the runner resolved it temporarily, but corruption recurred.
+- **Evidence**: Session log 2025-12-15; multiple manual SSH debugging sessions documented.
+- **Implication**: The act-runner cache directory on the VPS was susceptible to partial writes or interrupted downloads, leaving broken module stubs.
+
+### O4: Native Gitea remote cache times out under load
+- **Statement**: When using the native Gitea cache server (act-cache-server protocol), connections to 45.137.68.83:39913 timed out with `ETIMEDOUT`. The .NET cache step took 5 minutes where comparable steps took 5 seconds, indicating cache misses on every run.
+- **Evidence**: Session log 2025-12-15; ETIMEDOUT error on port 39913 documented.
+- **Implication**: Gitea's built-in act-cache-server is unreliable under this workload; local file cache is the viable alternative.
+
+### O5: Mac ARM64 runner OOM under concurrent dotnet builds
+- **Statement**: The macOS ARM64 runner (OrbStack) crashed under load when running concurrent dotnet builds. Capacity was progressively reduced: 6 → 4 → 3 → 2 concurrent jobs before stability was achieved. OrbStack does not expose swap (macOS manages memory pressure), amplifying OOM frequency.
+- **Evidence**: Session log 2025-12-19; capacity reduction sequence explicitly documented.
+- **Implication**: dotnet SDK memory footprint (compilation, MSBuild process, build server) is substantial; concurrent jobs multiply it linearly. 2 concurrent jobs is the stable limit for this workload on available hardware.
+
+### O6: Mixed-architecture runner cache corruption (silent failure)
+- **Statement**: When both the arm64 macOS runner and the x86-64 Unraid runner shared a cache backend (local file cache or remote), .NET build artifacts written by one architecture were read and used by the other. Builds did not fail explicitly — they produced architecturally incompatible binaries silently.
+- **Evidence**: Session log 2025-12-18 / 2025-12-19; documented as the root cause of intermittent unexplained failures; identified as the primary unresolved issue as of 2025-12-19.
+- **Implication**: Cache keys must encode architecture (e.g., `{{ runner.arch }}`), or builds must be pinned to a single architecture via runner labels.
+
+### O7: Second runner registration broke existing runner
+- **Statement**: When a second Gitea runner was registered on the same Mac host alongside the first, the existing runner broke. The second runner was deleted and changes reverted.
+- **Evidence**: Session log 2025-12-18; explicit documentation of "Deleted runner 2, reverted everything after frustration."
+- **Implication**: Running multiple act-runner instances on the same host requires careful port/socket isolation; naïve second registration causes conflicts.
+
+### O8: yaml-schema-validator action pull access denied
+- **Statement**: The `yaml-schema-validator` action used in a workflow step failed with "pull access denied" — the Docker image for this action could not be pulled inside the runner job container.
+- **Evidence**: Session log 2025-12-18; documented as one of the runner failure error logs.
+- **Implication**: Third-party actions requiring Docker image pulls inside job containers depend on correct Docker auth and network access within those containers.
+
+## Gaps
+
+### G1: No reliable DNS propagation path to runner job containers
+- **Statement**: There is no documented, working method to propagate internal DNS resolution (for git.wylab.me) into act-runner's job containers while keeping the runner in container-network mode.
+- **Caused by**: O2
+- **Existing attempts**: Adding 1.1.1.1 to runner DNS config, host networking mode, applying DNS at app container level.
+- **Why they fail**: 1.1.1.1 cannot resolve private internal hostnames. Host networking partially works but is a security boundary violation. DNS config must target the job-spawned containers, not the runner wrapper.
+
+### G2: Cache keys do not encode runner architecture
+- **Statement**: Default Gitea Actions cache key templates (e.g., based on OS + hash of lock files) do not include CPU architecture. On homogeneous runners this is safe; on heterogeneous setups it causes cross-architecture cache pollution.
+- **Caused by**: O6
+- **Existing attempts**: Local file cache as alternative to remote cache (reduces exposure but doesn't eliminate it if multiple runners share a mount).
+- **Why they fail**: Local file cache per-runner would fix it, but shared NFS/network mounts reintroduce the problem. The root fix requires architecture in the key.
+
+### G3: No OOM protection for concurrent dotnet builds
+- **Statement**: act-runner's concurrency setting is a single integer with no per-job resource guards. There is no mechanism to limit memory per job or to queue builds when memory pressure is high.
+- **Caused by**: O5
+- **Existing attempts**: Reducing concurrent job count (6 → 2).
+- **Why they fail**: Manual tuning is brittle; a single large build can still OOM even at concurrency=2 if the dotnet build server accumulates memory over time.
+
+## Key Insight
+- **Insight**: The most dangerous failure mode in multi-runner Gitea Actions setups is **silent artifact corruption from architecture mismatch in shared cache**. Unlike DNS failures (explicit error) or OOM (process crash), a cache hit returning wrong-arch binaries produces no error signal — the build "succeeds" but the output is unusable or subtly wrong. This failure mode is invisible to standard CI green/red status.
+- **Derived from**: O6, O2 (contrast: DNS and OOM failures are explicit; cache corruption is silent)
+- **Enables**: Prioritize architecture-tagged cache keys and runner label pinning over all other fixes; treat cache key design as a correctness constraint, not a performance optimization.
+
+## Assumptions
+- A1: The Gitea instance (git.wylab.me) is reachable from all runner hosts but NOT from inside runner job containers using default Docker bridge networking.
+- A2: The wylab-station-14 repository requires .NET SDK for building the SS14 server (C# codebase, same as upstream space-wizards/space-station-14).
+- A3: The Unraid host runner is x86-64; the macOS runner (OrbStack) is ARM64.
+- A4: The target deployment environment (Docker container on Unraid) is x86-64, so arm64-compiled artifacts are always wrong for production.
+- A5: OrbStack on macOS does not expose swap memory; memory pressure causes OOM kills rather than graceful degradation.