4.9 KiB
4.9 KiB
Heuristics
H01: Always encode runner architecture in .NET cache keys
- Rationale: dotnet build artifacts (NuGet packages, MSBuild output) are architecture-specific. A cache key based only on project file hashes will cause cross-architecture pollution when multiple runner architectures share a cache backend. The failure is silent: builds produce wrong-arch binaries with a green CI status. Encoding
{{ runner.arch }}in the key creates architecture-isolated cache namespaces at zero cost. - Sensitivity: high — omitting this causes silent artifact corruption, the most dangerous failure mode in this pipeline
- Bounds: Required whenever ≥2 runner architectures exist in the runner pool. Redundant (but harmless) when all runners are pinned to one architecture via labels.
- Code ref: src/execution/cache_key_strategy.py
- Source: Session logs 2025-12-18 / 2025-12-19; stated as primary unresolved issue; documented explicitly in project dead ends
H02: Set shutdown_timeout: 30m in runner config to prevent zombie containers
- Rationale: act-runner jobs that are cancelled, timeout, or fail mid-step leave Docker containers running if no shutdown timeout is configured. These zombie containers accumulate over time, consuming disk space, memory, and Docker namespace resources. A 30-minute timeout ensures containers are killed after the maximum expected job duration.
- Sensitivity: medium — impact grows over time; initially invisible, becomes critical after days of accumulated zombies
- Bounds: 30 minutes is a reasonable upper bound for dotnet build + docker build operations in this codebase. Too low (e.g., 5m) will kill legitimate long builds. Too high (e.g., 24h) provides no protection.
- Code ref: src/configs/training.md
- Source: Session logs 2025-12-15 and 2025-12-19; applied to both external runner and Mac runner
H03: Cap concurrent job capacity at 2 for dotnet builds on memory-constrained runners
- Rationale: dotnet's MSBuild build server and NuGet restore are memory-intensive. On a runner host without swap (OrbStack on macOS), concurrent jobs multiply memory usage without the safety valve of swap, triggering OOM kills. The empirically safe limit for this workload on the available hardware is 2.
- Sensitivity: high — exceeding this causes OOM crashes that interrupt all running builds
- Bounds: Capacity=2 is the stable floor for the SS14 codebase on OrbStack. May safely increase on Linux runners with swap. Never exceed 2 on OrbStack without swap.
- Code ref: src/configs/training.md
- Source: Session log 2025-12-19; capacity reduction sequence 6 → 4 → 3 → 2 explicitly documented
H04: Use local file cache instead of act-cache-server for reliability
- Rationale: Gitea's built-in act-cache-server relies on TCP connectivity from job containers to the runner host on port 39913. This fails with ETIMEDOUT when network configuration is incorrect, causing every build to be a full cold build (5 minutes for .NET packages instead of 5 seconds). Local file cache bypasses the HTTP protocol entirely by mounting a host directory into the job container — no network required, no port to configure.
- Sensitivity: medium — the cache server works correctly when network config is right; local file cache is just simpler and more reliable
- Bounds: Local file cache works well for single-runner setups. For multi-runner setups sharing a cache, the local files must be on a shared mount — which reintroduces the need for architecture-tagged keys (H01). For distributed teams with many runners, a properly configured remote cache may be preferable.
- Code ref: src/configs/training.md
- Source: Session log 2025-12-15 (ETIMEDOUT failure); session 2025-12-19 (switched to local file cache)
H05: Register only one runner per host; isolate runners by runner name and port
- Rationale: Running two act-runner instances on the same host naïvely (without port/socket isolation) causes runner conflicts and breaks the existing runner. Each act-runner instance needs distinct config.yml paths, work directories, and HTTP ports to coexist. Simpler: use a single runner per host and tune its concurrency instead of adding a second instance.
- Sensitivity: high — second naive registration immediately breaks the first runner
- Bounds: If multiple runners on one host are truly needed, they must have distinct: (1) runner name, (2) config.yml path, (3) work directory, (4) listener port. One runner per host is simpler and sufficient for typical homelab workloads.
- Code ref: src/configs/training.md
- Source: Session log 2025-12-18; "Deleted runner 2, reverted everything after frustration" — explicit documentation of this failure