feat: ss14-cicd ARA — Seal L1 pass, mixed-arch cache dead end documented
This commit is contained in:
@@ -0,0 +1,18 @@
|
||||
# Evidence Index
|
||||
|
||||
All evidence sourced from session logs in HISTORY.md (Makar Novozhilov's engineering notes,
|
||||
December 2025). No formal benchmarks or tables were produced — evidence is operational log data.
|
||||
|
||||
## Tables
|
||||
|
||||
| File | Source | Claims | Description |
|
||||
|------|--------|--------|-------------|
|
||||
| [tables/table1_runner_configurations.md](tables/table1_runner_configurations.md) | HISTORY.md: 2025-12-14 through 2025-12-19 | C01, C02, C03, C04 | All three runner configurations attempted, with outcomes and failure modes |
|
||||
| [tables/table2_cache_failure_modes.md](tables/table2_cache_failure_modes.md) | HISTORY.md: 2025-12-15, 2025-12-19 | C01, C02 | Cache strategies tested and their failure modes |
|
||||
| [tables/table3_capacity_oom_progression.md](tables/table3_capacity_oom_progression.md) | HISTORY.md: 2025-12-19 | C04 | OOM-driven concurrent job capacity reduction sequence |
|
||||
| [tables/table4_dns_approaches.md](tables/table4_dns_approaches.md) | HISTORY.md: 2025-12-14 | C03 | DNS resolution approaches tested and their outcomes |
|
||||
|
||||
## Figures
|
||||
|
||||
No quantitative figures available — this is an engineering log project, not a benchmarked experiment.
|
||||
Performance observations (e.g., "5 minutes vs 5 seconds") are captured in the tables above.
|
||||
@@ -0,0 +1,11 @@
|
||||
# Table 1 — Runner Configurations Attempted
|
||||
|
||||
**Source**: Session logs in HISTORY.md: 2025-12-14, 2025-12-15, 2025-12-18, 2025-12-19
|
||||
**Caption**: All three act-runner configurations attempted for the wylab-station-14 CI/CD pipeline, with their architecture, outcome, and primary failure mode.
|
||||
**Extraction type**: raw_table
|
||||
|
||||
| Configuration | Host | Architecture | Period | Primary Failure Mode | Outcome |
|
||||
|--------------|------|-------------|--------|---------------------|---------|
|
||||
| Unraid container runner | 192.168.1.50 (Unraid) | x86-64 (amd64) | 2025-12-14 | DNS resolution failure — job containers cannot resolve git.wylab.me in bridge network mode; 1/6 jobs succeeded with host networking | Reverted / partially abandoned |
|
||||
| External VPS runner | 45.137.68.83 (Contabo, Düsseldorf) | x86-64 (amd64) | 2025-12-15 | Node.js module errors in .cache/act/; native Gitea cache ETIMEDOUT on port 39913; .NET cache step 5 min vs 5 sec | Abandoned — crashing under load |
|
||||
| macOS ARM64 runner (OrbStack) | Developer MacBook, Apple Silicon | ARM64 (arm64) | 2025-12-18 – 2025-12-19 | OOM crashes with concurrent dotnet builds; mixed-arch cache corruption (arm64 entries consumed by x86-64 jobs, silent wrong-arch artifacts); runner kept crashing | Active but unstable as of 2025-12-19; fix: architecture-tagged cache keys OR runner label pinning |
|
||||
@@ -0,0 +1,12 @@
|
||||
# Table 2 — Cache Strategies and Failure Modes
|
||||
|
||||
**Source**: Session logs in HISTORY.md: 2025-12-15, 2025-12-18, 2025-12-19
|
||||
**Caption**: Cache strategies tested for the wylab-station-14 .NET build pipeline, with their configuration, observed behavior, and disposition.
|
||||
**Extraction type**: raw_table
|
||||
|
||||
| Cache Strategy | Protocol | Configured On | Observed Behavior | Failure Mode | Disposition |
|
||||
|---------------|----------|---------------|------------------|-------------|------------|
|
||||
| Native Gitea act-cache-server (remote) | HTTP on port 39913 | External VPS runner (45.137.68.83) | .NET cache step: ~5 minutes (vs ~5 seconds for other steps); ETIMEDOUT connecting to 45.137.68.83:39913 from inside job containers | Full cache miss every build; job containers cannot reach port 39913 through Docker bridge | Abandoned |
|
||||
| Local file cache (volume mount) | Filesystem (no HTTP) | macOS OrbStack runner | Stable cache hits; no timeout errors | No direct failure — but shared across arm64 and amd64 runners via NFS mount would cause C01 cache corruption | Adopted for single-runner use; requires architecture-tagged keys for multi-runner |
|
||||
| Architecture-agnostic cache key | N/A (key design flaw) | Both runners sharing cache | arm64 cache entries with key `dotnet-{hash}` returned as hits for amd64 jobs (same project hash, same key) | Silent wrong-arch artifact production: builds pass CI but produce arm64 binaries on x86-64 deployment target | Root cause of C01; fix: include runner.arch in key |
|
||||
| Architecture-tagged cache key | N/A (proposed fix) | Proposed for all runners | Not yet tested as of 2025-12-19 | No known failure mode — key includes arch, making cross-arch collision impossible | Proposed as C05 fix; see H01 |
|
||||
@@ -0,0 +1,17 @@
|
||||
# Table 3 — OOM-Driven Concurrent Job Capacity Reduction
|
||||
|
||||
**Source**: Session log in HISTORY.md: 2025-12-19
|
||||
**Caption**: Sequential reduction of act-runner concurrent job capacity on the macOS ARM64 OrbStack runner due to out-of-memory crashes from concurrent dotnet builds. OrbStack does not expose swap memory (macOS manages memory pressure at hypervisor level).
|
||||
**Extraction type**: raw_table
|
||||
|
||||
| Step | Capacity Setting | Observed Outcome | Action Taken |
|
||||
|------|-----------------|-----------------|-------------|
|
||||
| Initial | 6 concurrent jobs | OOM crash under dotnet build load | Reduced capacity |
|
||||
| Reduction 1 | 4 concurrent jobs | Still OOM crashing | Reduced capacity |
|
||||
| Reduction 2 | 3 concurrent jobs | Still OOM crashing | Reduced capacity |
|
||||
| Reduction 3 | 2 concurrent jobs | Stable — no OOM crashes observed | Kept at 2 |
|
||||
|
||||
**Notes**:
|
||||
- OrbStack constraint: no swap exposed to Linux VM; OOM kills are abrupt without graceful degradation
|
||||
- Dotnet memory pressure: MSBuild build server + NuGet restore + compilation all consume significant memory; concurrent jobs multiply this linearly
|
||||
- This constraint is specific to OrbStack on macOS; Linux runners with swap enabled may support higher concurrency
|
||||
@@ -0,0 +1,14 @@
|
||||
# Table 4 — DNS Resolution Approaches and Outcomes
|
||||
|
||||
**Source**: Session log in HISTORY.md: 2025-12-14
|
||||
**Caption**: DNS resolution approaches tested for enabling runner job containers to resolve the internal hostname git.wylab.me, which is only served by Technitium DNS at 192.168.1.50.
|
||||
**Extraction type**: raw_table
|
||||
|
||||
| Approach | Configuration | Target | Result | Notes |
|
||||
|----------|--------------|--------|--------|-------|
|
||||
| Default bridge network | No custom DNS | Runner process container | Failed — git.wylab.me unresolvable in all job containers | Docker bridge DNS does not forward to Technitium |
|
||||
| External DNS (1.1.1.1) | Added 1.1.1.1 to runner DNS config | Runner process container | Failed — 1.1.1.1 cannot resolve private internal hostname | Public DNS has no record for git.wylab.me |
|
||||
| Host networking mode | container.network: host in runner config | Job containers | Partial — 1 out of 6 jobs succeeded; inconsistent | Mechanism of inconsistency not determined; changes reverted |
|
||||
| Apply DNS to app container (alternative) | DNS applied to Gitea app container (not runner) | Wrong target | Failed — wrong target; Gitea app does not need DNS fix, runner job containers do | Misidentification of which component needs the fix |
|
||||
| Docker bridge gateway (172.17.0.1) | container.dns: ["172.17.0.1"] in runner config (proposed) | Job containers | Not tested as of 2025-12-14 | Bridge gateway forwards to Technitium; expected to work based on Docker networking model |
|
||||
| Technitium direct (192.168.1.50) | container.dns: ["192.168.1.50"] in runner config (proposed) | Job containers | Not tested | Only works if 192.168.1.50 is reachable from Docker bridge subnet |
|
||||
@@ -0,0 +1,43 @@
|
||||
# Concepts
|
||||
|
||||
## act-runner
|
||||
- **Notation**: `act-runner` (binary), `config.yml` (configuration file)
|
||||
- **Definition**: The Gitea Actions runner daemon. It polls the Gitea server for pending workflow jobs, spawns per-job containers (using Docker or a process executor), and streams logs back. It is the Gitea equivalent of the GitHub Actions self-hosted runner. Each runner registers with a token and is assigned labels (e.g., `ubuntu-latest`, `self-hosted`). The runner binary manages the job container lifecycle including cache volume mounts.
|
||||
- **Boundary conditions**: Applies when Gitea Actions workflows are used (not Gitea CI's legacy YAML format). Requires Docker to be installed on the host if container executor is used. Job containers inherit the runner's Docker daemon socket by default. Does not apply to GitHub Actions (uses `actions/runner`, different protocol).
|
||||
- **Related concepts**: Runner label pinning, Runner job container, Cache key, Gitea Actions
|
||||
|
||||
## Runner label pinning
|
||||
- **Notation**: `runs-on: <label>` in workflow YAML; `labels: [<label>]` in runner config.yml
|
||||
- **Definition**: A mechanism to restrict workflow jobs to runners with specific labels. In Gitea Actions YAML, `runs-on: unraid` matches only runners whose config.yml declares `labels: [unraid, ...]`. This creates exclusive routing: a job that specifies `runs-on: unraid` will never execute on the macOS ARM64 runner (which carries a different label set). Used to guarantee architecture, OS, and resource availability for a given job.
|
||||
- **Boundary conditions**: Effective only when runner labels are unique per architecture/host. If two runners of different architectures share the same label, pinning fails. Does not prevent OOM — it only controls scheduling.
|
||||
- **Related concepts**: act-runner, Cache key, Architecture-tagged cache key
|
||||
|
||||
## Architecture-tagged cache key
|
||||
- **Notation**: `key: dotnet-{{ runner.arch }}-{{ hashFiles('**/*.csproj') }}`
|
||||
- **Definition**: A cache key that encodes the runner's CPU architecture (e.g., `amd64`, `arm64`) alongside the content hash of dependency files. This ensures that cache entries written by an arm64 runner are never read by an amd64 runner and vice versa. In Gitea Actions, `runner.arch` or `runner.os` context variables can be embedded in the key expression. Without architecture encoding, a cache written by arm64 may be returned as a hit for an amd64 job if all other key components match.
|
||||
- **Boundary conditions**: Requires Gitea Actions to expose `runner.arch` in the expression context. If not available, the architecture must be injected as a job-level environment variable. Does not help if the runner label is already pinned to a single architecture (redundant but not harmful in that case).
|
||||
- **Related concepts**: Cache key, Runner label pinning, act-runner, Silent cache corruption
|
||||
|
||||
## act-cache-server (native Gitea cache)
|
||||
- **Notation**: `ACTIONS_CACHE_URL=http://<host>:39913/` (environment variable injected into job containers)
|
||||
- **Definition**: Gitea's built-in cache server that implements the GitHub Actions cache API protocol. When enabled, act-runner starts a local HTTP server on port 39913 and injects the URL into job containers so that the `actions/cache` step can store/restore cache artifacts via HTTP. The cache is stored on the runner host filesystem. This is distinct from local file cache (which bypasses the HTTP protocol entirely).
|
||||
- **Boundary conditions**: Requires network reachability from job containers to port 39913 on the runner host. In bridge network mode, this is the Docker bridge gateway IP. Fails with ETIMEDOUT if the job container cannot reach the host port. Sensitive to firewall rules and Docker network configuration. Not applicable when using local file cache strategy instead.
|
||||
- **Related concepts**: act-runner, Local file cache, Cache key, Runner job container
|
||||
|
||||
## Runner job container
|
||||
- **Notation**: spawned by act-runner per job; uses Docker `create` + `start` lifecycle
|
||||
- **Definition**: A Docker container spawned by act-runner for each individual workflow job. The container image is specified by the workflow (e.g., `container: ubuntu:22.04`) or uses the default act-runner image. The job container is ephemeral — created at job start, destroyed at job end. It has its own network namespace (by default, Docker bridge), its own filesystem (with volumes for workspace and cache), and its own DNS resolver. This is why DNS config applied to the act-runner process itself does not propagate: the process and the job container are separate network namespaces.
|
||||
- **Boundary conditions**: DNS, network access, and Docker socket access are all configured per-container, not inherited from the runner process. `container.network: host` in runner config passes the host network namespace, eliminating DNS isolation but also removing network security boundaries.
|
||||
- **Related concepts**: act-runner, DNS resolution failure, act-cache-server
|
||||
|
||||
## OrbStack
|
||||
- **Notation**: OrbStack (macOS application), replaces Docker Desktop and Colima
|
||||
- **Definition**: A macOS virtualization tool that provides a Linux VM for running Docker containers on Apple Silicon (ARM64) and Intel Macs. It is used as a replacement for Docker Desktop and Colima, offering lower overhead and faster startup. When running act-runner on macOS, all job containers execute inside the OrbStack Linux VM, which is ARM64 on Apple Silicon. OrbStack does not expose swap memory to the Linux VM — macOS manages memory pressure at the hypervisor level, meaning the VM's OOM killer activates without the grace period that Linux swap provides.
|
||||
- **Boundary conditions**: Only applicable on macOS. ARM64-only on Apple Silicon Macs. No swap exposure means OOM kills are abrupt. Not relevant on Linux runners (Unraid, VPS).
|
||||
- **Related concepts**: Runner job container, OOM capacity limit, Architecture-tagged cache key
|
||||
|
||||
## Gitea Actions
|
||||
- **Notation**: `.gitea/workflows/*.yaml` (workflow files), Gitea web UI for runs
|
||||
- **Definition**: Gitea's built-in CI/CD system, compatible with GitHub Actions workflow syntax. Workflow files are placed in `.gitea/workflows/` (or `.github/workflows/` — Gitea reads both). Jobs are dispatched to registered act-runners based on label matching. Supports secrets, artifacts, cache (via act-cache-server or third-party action implementations), and most standard GitHub Actions context variables. Key differences from GitHub Actions: no hosted runners (all runners are self-hosted), cache server is act-cache-server rather than GitHub's cache CDN, and some third-party actions may fail due to Docker image pull restrictions inside job containers.
|
||||
- **Boundary conditions**: Requires at least one registered act-runner to execute jobs. Jobs remain queued indefinitely if no matching runner is available. Third-party actions that pull Docker images require Docker registry access from inside job containers.
|
||||
- **Related concepts**: act-runner, Runner label pinning, Gitea instance
|
||||
@@ -0,0 +1,86 @@
|
||||
# Experiments
|
||||
|
||||
## E01: Architecture-tagged cache key A/B comparison
|
||||
- **Verifies**: C01, C05
|
||||
- **Setup**:
|
||||
- Model: wylab-station-14 CI workflow (Gitea Actions, `.gitea/workflows/build.yaml`)
|
||||
- Hardware: Two runners — arm64 macOS (OrbStack) + x86-64 Unraid host
|
||||
- Dataset: Multiple commits to wylab-station-14 triggering builds
|
||||
- System: Gitea at git.wylab.me, shared cache backend (local file cache on shared NFS mount or act-cache-server)
|
||||
- **Procedure**:
|
||||
1. Configure workflow with architecture-agnostic cache key (baseline): `key: dotnet-${{ hashFiles('**/*.csproj') }}`
|
||||
2. Trigger 5 builds alternating between arm64 and x86-64 runners
|
||||
3. Inspect produced Docker image: verify target architecture with `docker inspect --format '{{.Architecture}}'`
|
||||
4. Switch to architecture-tagged key: `key: dotnet-${{ runner.arch }}-${{ hashFiles('**/*.csproj') }}`
|
||||
5. Repeat 5 builds alternating runners
|
||||
6. Inspect produced Docker images again
|
||||
- **Metrics**: Binary artifact architecture correctness (amd64 vs arm64 output); cache hit rate per architecture; build success rate
|
||||
- **Expected outcome**:
|
||||
- Baseline (agnostic key): some builds produce wrong-architecture artifacts despite green CI status
|
||||
- Tagged key: all builds produce correct x86-64 artifacts; cache hit rate maintained; no cross-arch pollution
|
||||
- **Baselines**: Agnostic-key multi-arch setup (documented failure state from 2025-12-18 sessions)
|
||||
- **Dependencies**: none
|
||||
|
||||
## E02: Runner label pinning vs. architecture-tagged keys — effectiveness comparison
|
||||
- **Verifies**: C01, C05
|
||||
- **Setup**:
|
||||
- Model: wylab-station-14 CI workflow
|
||||
- Hardware: Same two-runner setup (arm64 Mac + x86-64 Unraid)
|
||||
- Dataset: 10 consecutive commits
|
||||
- System: Gitea Actions with label routing
|
||||
- **Procedure**:
|
||||
1. Add `runs-on: unraid` to all build jobs in workflow YAML
|
||||
2. Ensure Unraid runner config.yml has `labels: [unraid, self-hosted, linux]`
|
||||
3. Trigger 10 builds and verify all execute on Unraid runner only
|
||||
4. Inspect Docker image architecture for all 10 builds
|
||||
5. Measure cache hit rate and build time compared to E01 tagged-key baseline
|
||||
- **Metrics**: Fraction of jobs routed to Unraid runner; artifact architecture correctness; build time distribution
|
||||
- **Expected outcome**:
|
||||
- All 10 builds execute on x86-64 Unraid runner
|
||||
- All 10 artifacts are amd64 architecture
|
||||
- Build times are consistent (no cross-arch cache misses)
|
||||
- Cache hit rate on Unraid runner is higher than in mixed setup
|
||||
- **Baselines**: Mixed-runner setup without label pinning (documented as failure state)
|
||||
- **Dependencies**: E01
|
||||
|
||||
## E03: Local file cache vs. act-cache-server reliability comparison
|
||||
- **Verifies**: C02, C04
|
||||
- **Setup**:
|
||||
- Model: wylab-station-14 CI workflow with .NET cache step
|
||||
- Hardware: Single runner (either Unraid x86-64 or external VPS 45.137.68.83)
|
||||
- Dataset: 10 consecutive builds
|
||||
- System: Gitea Actions; two cache configurations tested sequentially
|
||||
- **Procedure**:
|
||||
1. Configure runner to use native act-cache-server (port 39913 enabled in runner config)
|
||||
2. Run 10 builds; record cache step duration per build
|
||||
3. Reconfigure runner to use local file cache (volume-mounted directory, no HTTP protocol)
|
||||
4. Run 10 builds; record cache step duration per build
|
||||
5. Compare: cache step latency distribution, cache hit/miss count, error rate
|
||||
- **Metrics**: Cache step duration (seconds); cache hit rate; ETIMEDOUT error rate; total build duration
|
||||
- **Expected outcome**:
|
||||
- act-cache-server: some fraction of builds show 5-minute cache step (ETIMEDOUT → miss); inconsistent
|
||||
- Local file cache: cache step consistently fast; zero timeout errors; higher hit rate
|
||||
- **Baselines**: Uncached builds (no cache step) as lower bound reference
|
||||
- **Dependencies**: none
|
||||
|
||||
## E04: DNS configuration strategies for runner job containers
|
||||
- **Verifies**: C03
|
||||
- **Setup**:
|
||||
- Model: wylab-station-14 CI workflow with a `git clone git.wylab.me/...` step
|
||||
- Hardware: Unraid host running act-runner in Docker container (bridge network)
|
||||
- Dataset: Single test commit triggering build
|
||||
- System: act-runner config.yml, Docker daemon.json on Unraid
|
||||
- **Procedure**:
|
||||
1. Baseline: default bridge network, no custom DNS — record failure (git.wylab.me unresolvable)
|
||||
2. Test A: Add `dns: ["1.1.1.1"]` to runner process container — record result
|
||||
3. Test B: Use `container.network: host` in runner config.yml — record result (1/6 success expected)
|
||||
4. Test C: Configure `container.dns: ["172.17.0.1"]` in runner config.yml (Docker bridge gateway, which forwards to Technitium) — record result
|
||||
5. Test D: Configure `container.dns: ["192.168.1.50"]` directly (Technitium IP, only works if reachable from bridge) — record result
|
||||
- **Metrics**: DNS resolution success rate (0-6 job containers resolving git.wylab.me); build trigger rate; pipeline jobs succeeding end-to-end
|
||||
- **Expected outcome**:
|
||||
- Test A (1.1.1.1): fails — cannot resolve private internal hostname
|
||||
- Test B (host network): partially works — host DNS available, but security boundary removed
|
||||
- Test C (bridge gateway 172.17.0.1): expected to work — bridge gateway forwards to Technitium which resolves internal names
|
||||
- Test D (192.168.1.50 direct): may work if Technitium is reachable from the bridge subnet
|
||||
- **Baselines**: Default bridge network (O2 documented failure state); host networking (O2 1/6 partial success)
|
||||
- **Dependencies**: none
|
||||
@@ -0,0 +1,71 @@
|
||||
# Related Work
|
||||
|
||||
## RW01: Gitea / Gitea Actions
|
||||
- **DOI**: https://gitea.com / https://docs.gitea.com/usage/actions/overview
|
||||
- **Type**: imports
|
||||
- **Delta**:
|
||||
- What changed: Gitea Actions is the CI/CD system used directly; this pipeline is built on its workflow syntax and act-runner infrastructure
|
||||
- Why: Self-hosted alternative to GitHub Actions; compatible workflow YAML syntax; integrated with self-hosted Gitea git server
|
||||
- **Claims affected**: C02, C03
|
||||
- **Adopted elements**: Workflow YAML syntax (`.gitea/workflows/*.yaml`), runner registration protocol, label-based job routing, act-cache-server cache protocol
|
||||
|
||||
## RW02: act-runner (Gitea's official runner)
|
||||
- **DOI**: https://gitea.com/gitea/act_runner
|
||||
- **Type**: imports
|
||||
- **Delta**:
|
||||
- What changed: act-runner is the execution engine for all workflow jobs in this pipeline
|
||||
- Why: Official runner for Gitea Actions; supports Docker container executor; manages job container lifecycle
|
||||
- **Claims affected**: C01, C02, C03, C04, C05
|
||||
- **Adopted elements**: Container executor, config.yml schema (capacity, shutdown_timeout, cache), label registration
|
||||
|
||||
## RW03: nektos/act (upstream of act-runner)
|
||||
- **DOI**: https://github.com/nektos/act
|
||||
- **Type**: bounds
|
||||
- **Delta**:
|
||||
- What changed: act-runner is a fork/derivative of nektos/act; the container spawning model, act-cache-server protocol, and workflow execution engine are inherited from act
|
||||
- Why: Understanding act's architecture explains act-runner's behavior (separate network namespace per job container, Docker socket mounting, cache server protocol)
|
||||
- **Claims affected**: C03 (Docker bridge DNS isolation comes from act's container model)
|
||||
- **Adopted elements**: Container lifecycle model, act-cache-server protocol (port 39913), job container network isolation
|
||||
|
||||
## RW04: space-wizards/space-station-14
|
||||
- **DOI**: https://github.com/space-wizards/space-station-14
|
||||
- **Type**: imports
|
||||
- **Delta**:
|
||||
- What changed: wylab-station-14 is a fork of space-wizards/space-station-14; the build process, .NET SDK requirements, and project structure are inherited from upstream
|
||||
- Why: SS14 is a C# game server codebase requiring .NET SDK; the dotnet build complexity and memory footprint drive the OOM constraints (C04)
|
||||
- **Claims affected**: C04 (OOM due to dotnet build complexity)
|
||||
- **Adopted elements**: .NET project structure, Dockerfile pattern for server containerization
|
||||
|
||||
## RW05: OrbStack
|
||||
- **DOI**: https://orbstack.dev
|
||||
- **Type**: baseline
|
||||
- **Delta**:
|
||||
- What changed: OrbStack replaced Colima as the macOS Docker provider for the ARM64 runner
|
||||
- Why: Lower overhead, faster startup than Docker Desktop and Colima; but introduces no-swap constraint
|
||||
- **Claims affected**: C04 (no-swap behavior amplifies OOM risk)
|
||||
- **Adopted elements**: Linux VM for Docker containers on macOS; Docker socket integration
|
||||
|
||||
## RW06: Technitium DNS Server
|
||||
- **DOI**: https://technitium.com/dns/
|
||||
- **Type**: bounds
|
||||
- **Delta**:
|
||||
- What changed: Technitium runs on Unraid (192.168.1.50) and is the authoritative resolver for internal hostnames including git.wylab.me
|
||||
- Why: Without Technitium, git.wylab.me is unresolvable; runner job containers need access to Technitium to complete git operations
|
||||
- **Claims affected**: C03
|
||||
- **Adopted elements**: Internal DNS authority for *.wylab.me; resolves to Gitea container IP
|
||||
|
||||
## RW07: Traefik Reverse Proxy
|
||||
- **DOI**: https://traefik.io
|
||||
- **Type**: bounds
|
||||
- **Delta**:
|
||||
- What changed: Traefik handles TLS termination and HTTP routing for git.wylab.me on Unraid
|
||||
- Why: Gitea is exposed externally and internally through Traefik; runner job containers need to route through Traefik to reach Gitea
|
||||
- **Claims affected**: C03
|
||||
- **Adopted elements**: TLS termination, routing for git.wylab.me → Gitea container
|
||||
|
||||
## Additional Infrastructure References
|
||||
|
||||
- **Docker** (docker.com): Container runtime used by act-runner for job containers and by Unraid for all hosted services including Gitea and the SS14 server. Docker bridge networking is the root cause of DNS isolation issues (C03).
|
||||
- **GitHub Actions** (github.com/features/actions): The spiritual predecessor and compatibility target for Gitea Actions. Workflow YAML syntax is largely compatible. Key difference: GitHub provides hosted runners with reliable DNS; self-hosted Gitea Actions requires manual DNS configuration.
|
||||
- **Microsoft .NET SDK** (dotnet.microsoft.com): Build toolchain for Space Station 14 (C# codebase). The .NET build server's memory model is the root cause of OOM issues (C04). NuGet package cache is the primary cache target (C01, C02).
|
||||
- **Contabo VPS** (45.137.68.83, Düsseldorf): External runner host attempted during 2025-12-15 debugging. Abandoned due to Node.js module errors and cache ETIMEDOUT. Later reused as SS14 CDN mirror server.
|
||||
@@ -0,0 +1,138 @@
|
||||
# Algorithm
|
||||
|
||||
## Mathematical Formulation
|
||||
|
||||
### Cache Key Correctness Condition
|
||||
|
||||
Let `K` be the set of cache keys, `A ∈ {arm64, amd64}` be the runner architecture, and
|
||||
`H = hash(*.csproj)` be the content hash of project dependency files.
|
||||
|
||||
A cache key `k ∈ K` is **architecture-safe** if and only if:
|
||||
```
|
||||
k = f(A, H, ...) where f is injective in A
|
||||
```
|
||||
|
||||
That is: for any two keys k₁ = f(A₁, H, ...) and k₂ = f(A₂, H, ...) with A₁ ≠ A₂,
|
||||
we have k₁ ≠ k₂. This guarantees no cache collision across architectures.
|
||||
|
||||
An architecture-agnostic key `k = f(H)` (not injective in A) violates this condition:
|
||||
two runners with different architectures but the same H produce the same key, causing
|
||||
cross-architecture cache pollution.
|
||||
|
||||
### Runner Routing Condition
|
||||
|
||||
Let `L(r)` be the label set of runner `r`, and `J` be a job requiring label `l`.
|
||||
Runner `r` is eligible for `J` iff `l ∈ L(r)`.
|
||||
|
||||
For strict architecture pinning: assign unique architecture labels:
|
||||
```
|
||||
L(unraid_runner) = {unraid, self-hosted, linux, amd64}
|
||||
L(mac_runner) = {mac, self-hosted, darwin, arm64}
|
||||
```
|
||||
|
||||
Job YAML:
|
||||
```yaml
|
||||
runs-on: unraid # Only dispatched to unraid_runner
|
||||
```
|
||||
|
||||
This guarantees `A = amd64` for all builds of this job, making architecture-tagged
|
||||
cache keys redundant (but still good practice as defense-in-depth).
|
||||
|
||||
## Build Workflow Pseudocode
|
||||
|
||||
```
|
||||
PROCEDURE build_ss14_image(commit_sha):
|
||||
# Step 1: Checkout
|
||||
git clone git.wylab.me/wylab/wylab-station-14 --depth=1 --ref=commit_sha
|
||||
|
||||
# Step 2: Restore cache (architecture-safe)
|
||||
arch = runner.arch # "amd64" or "arm64"
|
||||
project_hash = sha256(glob("**/*.csproj"))
|
||||
cache_key = f"dotnet-{arch}-{project_hash}"
|
||||
restore cache(key=cache_key, path="~/.nuget/packages")
|
||||
|
||||
# Step 3: Build .NET server
|
||||
dotnet restore
|
||||
dotnet build --configuration Release --no-restore
|
||||
dotnet publish --configuration Release --output ./publish/
|
||||
|
||||
# Step 4: Save cache
|
||||
save cache(key=cache_key, path="~/.nuget/packages")
|
||||
|
||||
# Step 5: Build Docker image
|
||||
docker build -t ss14-server:commit_sha -f Dockerfile ./publish/
|
||||
docker tag ss14-server:commit_sha registry/wylab/ss14-server:latest
|
||||
|
||||
# Step 6: Push image
|
||||
docker push registry/wylab/ss14-server:latest
|
||||
|
||||
# Step 7: Report status
|
||||
report status to Gitea (success/failure)
|
||||
|
||||
PROCEDURE select_runner(job):
|
||||
eligible = {r for r in runners if job.runs_on ⊆ labels(r)}
|
||||
if |eligible| == 0:
|
||||
queue job indefinitely # DANGER: silent queue, no error
|
||||
else:
|
||||
dispatch to eligible runner with lowest load
|
||||
```
|
||||
|
||||
## Gitea Actions Workflow YAML (Recommended Pattern)
|
||||
|
||||
```yaml
|
||||
name: Build SS14 Server
|
||||
|
||||
on:
|
||||
push:
|
||||
branches: [main, master]
|
||||
|
||||
jobs:
|
||||
build:
|
||||
runs-on: unraid # CRITICAL: pin to x86-64 runner
|
||||
container:
|
||||
image: mcr.microsoft.com/dotnet/sdk:7.0
|
||||
options: --dns 172.17.0.1 # Docker bridge gateway → Technitium → git.wylab.me
|
||||
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v3
|
||||
|
||||
- name: Cache .NET packages
|
||||
uses: actions/cache@v3
|
||||
with:
|
||||
path: ~/.nuget/packages
|
||||
# Architecture encoded in key — eliminates cross-arch pollution
|
||||
key: dotnet-${{ runner.arch }}-${{ hashFiles('**/*.csproj') }}
|
||||
restore-keys: |
|
||||
dotnet-${{ runner.arch }}-
|
||||
|
||||
- name: Build
|
||||
run: dotnet build --configuration Release
|
||||
|
||||
- name: Publish
|
||||
run: dotnet publish --configuration Release --output ./publish
|
||||
|
||||
- name: Build Docker image
|
||||
run: docker build -t ss14-server:${{ github.sha }} ./publish
|
||||
|
||||
- name: Push Docker image
|
||||
run: docker push registry/wylab/ss14-server:latest
|
||||
```
|
||||
|
||||
## Complexity Analysis
|
||||
|
||||
| Operation | Complexity | Notes |
|
||||
|-----------|------------|-------|
|
||||
| Cache restore (hit) | O(n) where n = package size | Bounded by disk I/O; ~5s local |
|
||||
| Cache restore (miss/ETIMEDOUT) | O(n × download_time) | ~5 min observed |
|
||||
| dotnet build (cold) | O(source_lines × compile_factor) | Dominates build time; OOM risk |
|
||||
| dotnet build (warm) | O(changed_files) | Incremental; much faster |
|
||||
| docker build | O(layer_count) | Layer caching reduces to O(changed_layers) |
|
||||
| Runner job dispatch | O(1) | Poll-based; latency = poll interval |
|
||||
|
||||
## Key Invariants
|
||||
|
||||
1. **Architecture invariant**: All artifacts written to the production registry must have `Architecture: amd64`. Any arm64 artifact reaching the registry is a pipeline bug.
|
||||
2. **Cache isolation invariant**: No cache entry written by an arm64 runner may be consumed by an amd64 runner (and vice versa). Enforced by architecture-tagged keys.
|
||||
3. **DNS invariant**: All steps that contact git.wylab.me must run after DNS is confirmed resolvable. Failing fast on DNS errors prevents silent misrouting.
|
||||
4. **Capacity invariant**: Concurrent dotnet job count ≤ 2 on the current hardware configuration to avoid OOM.
|
||||
@@ -0,0 +1,99 @@
|
||||
# Architecture
|
||||
|
||||
## System Overview
|
||||
|
||||
The ss14-cicd pipeline is a self-hosted Gitea Actions system that builds a Docker image of the
|
||||
wylab-station-14 Space Station 14 game server and (optionally) deploys it to an Unraid Docker
|
||||
container. The system spans three physical hosts and multiple software layers.
|
||||
|
||||
## Components
|
||||
|
||||
### Component 1: Gitea Server
|
||||
- **Purpose**: Hosts the wylab-station-14 git repository and dispatches CI/CD workflow jobs to registered runners.
|
||||
- **Host**: Unraid server (Docker container), accessible at git.wylab.me via Traefik reverse proxy
|
||||
- **Inputs**: Git push events (commits, tags), pull request events; webhook triggers
|
||||
- **Outputs**: Workflow job dispatch requests sent to registered act-runners; build status updates; artifact storage (if configured)
|
||||
- **Key design choices**: Self-hosted on same Unraid host as the target deployment. Internal hostname (git.wylab.me) only resolvable via Technitium DNS (192.168.1.50). TLS terminated at Traefik.
|
||||
|
||||
### Component 2: act-runner (Gitea Actions Runner)
|
||||
- **Purpose**: Receives job dispatch from Gitea server, spawns per-job containers, executes workflow steps, streams logs back.
|
||||
- **Host options tried**: (a) Docker container on Unraid, (b) external VPS (45.137.68.83), (c) macOS via OrbStack
|
||||
- **Inputs**: Job YAML from Gitea, runner registration token, Docker socket
|
||||
- **Outputs**: Job execution logs, exit codes, artifact uploads, cache writes
|
||||
- **Key design choices**: Container executor (spawns Docker containers per job). Must have Docker socket access. Cache mode: local file cache preferred over act-cache-server. shutdown_timeout: 30m to prevent zombie containers.
|
||||
|
||||
### Component 3: Runner Job Container
|
||||
- **Purpose**: Isolated execution environment for each workflow job. Runs build steps (dotnet build, docker build, etc.).
|
||||
- **Host**: Inside act-runner's Docker environment (OrbStack VM on Mac, or Unraid Docker daemon)
|
||||
- **Inputs**: Workflow YAML step definitions, cached volumes (dotnet packages, npm modules), Docker socket (for docker build steps), git clone of wylab-station-14
|
||||
- **Outputs**: Built .NET assemblies, Docker image (pushed to registry), test results
|
||||
- **Key design choices**: Ephemeral — destroyed after each job. Network namespace is Docker bridge by default (causes DNS failure for git.wylab.me). Must access Docker socket to build the server image.
|
||||
|
||||
### Component 4: Cache Layer
|
||||
- **Purpose**: Store and restore .NET dependency packages between builds to reduce build time.
|
||||
- **Implementation**: Local file cache (volume-mounted directory on runner host). Previously attempted: act-cache-server (ETIMEDOUT failures).
|
||||
- **Inputs**: Cache key (dotnet-{{ arch }}-{{ hash }}), cache directory path
|
||||
- **Outputs**: Cache hit (restore packages) or miss (download fresh)
|
||||
- **Key design choices**: Architecture must be encoded in key. Local file cache preferred. Separate cache directories per runner architecture.
|
||||
|
||||
### Component 5: Gitea Container Registry / Docker Registry
|
||||
- **Purpose**: Store the built Docker image for deployment.
|
||||
- **Inputs**: Docker image layers from docker build step in job container
|
||||
- **Outputs**: Tagged image available for pull by Unraid Docker
|
||||
- **Key design choices**: May use Gitea's built-in container registry or an external registry (Docker Hub, GHCR). Exact registry choice not specified in source logs.
|
||||
|
||||
### Component 6: Unraid Docker (Deployment Target)
|
||||
- **Purpose**: Runs the live wylab-station-14 game server container.
|
||||
- **Host**: Unraid server (UM790 Pro, 32GB RAM, 20+ containers)
|
||||
- **Inputs**: Docker image from registry, server config files
|
||||
- **Outputs**: Running SS14 game server accessible to players
|
||||
- **Key design choices**: x86-64 architecture. Must receive x86-64 Docker image — arm64 images silently fail or crash.
|
||||
|
||||
## Data Flow
|
||||
|
||||
```
|
||||
Developer commits → git.wylab.me (Gitea)
|
||||
→ Gitea dispatches job to act-runner
|
||||
→ act-runner spawns runner job container (Docker)
|
||||
→ git clone wylab-station-14 (needs DNS resolution for git.wylab.me)
|
||||
→ restore cache (dotnet packages, arch-tagged key)
|
||||
→ dotnet build (compiles C# SS14 server code)
|
||||
→ docker build (wraps server in container image)
|
||||
→ docker push (sends image to registry)
|
||||
→ act-runner reports status to Gitea
|
||||
→ Gitea shows green/red build status
|
||||
→ [Manual or auto] Unraid pulls new image → restarts SS14 container
|
||||
```
|
||||
|
||||
## Topology Diagram (Text)
|
||||
|
||||
```
|
||||
[ Developer MacBook ]
|
||||
|
|
||||
| git push
|
||||
v
|
||||
[ Gitea: git.wylab.me ] ←── Traefik (TLS) ←── Unraid Docker (host: 192.168.1.50)
|
||||
|
|
||||
| job dispatch (HTTP poll)
|
||||
v
|
||||
[ act-runner ] ← registered on one of:
|
||||
(a) Unraid host (x86-64, preferred)
|
||||
(b) External VPS 45.137.68.83 (x86-64, abandoned)
|
||||
(c) macOS OrbStack (ARM64, problematic)
|
||||
|
|
||||
| docker create / start
|
||||
v
|
||||
[ Runner Job Container ] (ephemeral)
|
||||
├── git clone git.wylab.me/wylab/wylab-station-14
|
||||
├── restore cache → /root/.nuget/packages (arch-tagged key)
|
||||
├── dotnet build → SS14 server assemblies
|
||||
└── docker build → image: ss14-server:latest
|
||||
|
|
||||
| docker push
|
||||
v
|
||||
[ Container Registry ]
|
||||
|
|
||||
| docker pull (manual/auto)
|
||||
v
|
||||
[ Unraid: wylab-station-14 container ] → live game server
|
||||
```
|
||||
@@ -0,0 +1,55 @@
|
||||
# Constraints
|
||||
|
||||
## Boundary Conditions
|
||||
|
||||
### BC1: Single deployment architecture (amd64)
|
||||
The Unraid production server is x86-64 (amd64). Docker images built for arm64 will either
|
||||
fail to start (wrong ELF format) or run with performance overhead under QEMU emulation.
|
||||
All production builds MUST produce amd64 images. Any runner or workflow configuration that
|
||||
could produce arm64 artifacts without explicit multi-arch intent is a bug.
|
||||
|
||||
### BC2: Internal DNS required for Gitea access
|
||||
The Gitea server at git.wylab.me is only resolvable via Technitium DNS (192.168.1.50).
|
||||
Public DNS (1.1.1.1, 8.8.8.8) cannot resolve this hostname. Runner job containers must
|
||||
have access to an internal DNS resolver — either via host networking, Docker bridge gateway
|
||||
(172.17.0.1), or direct Technitium IP — to perform git operations against git.wylab.me.
|
||||
Builds that run without DNS fix will fail at the checkout step.
|
||||
|
||||
### BC3: OOM limit at concurrent jobs ≥ 3 on OrbStack
|
||||
The macOS ARM64 runner using OrbStack crashes under concurrent dotnet builds when capacity
|
||||
exceeds 2. This constraint is empirically derived from session logs (6 → 4 → 3 → 2).
|
||||
OrbStack's no-swap behavior amplifies OOM risk. This constraint does not apply to Linux
|
||||
runners with swap enabled.
|
||||
|
||||
### BC4: Zombie containers without shutdown_timeout
|
||||
Without `shutdown_timeout: 30m` in runner config.yml, containers from cancelled or timed-out
|
||||
jobs accumulate on the host. These zombie containers consume disk space, network namespaces,
|
||||
and potentially Docker resources. This constraint applies to all runner hosts.
|
||||
|
||||
### BC5: Third-party actions require Docker pull access
|
||||
Actions using Docker images (e.g., yaml-schema-validator) require the runner job container
|
||||
to have Docker registry access. In environments where the registry is unreachable (firewall,
|
||||
no internet, auth failure), these actions fail with "pull access denied." The yaml-schema-
|
||||
validator action failure is a documented instance.
|
||||
|
||||
### BC6: act-cache-server port 39913 must be reachable from job containers
|
||||
Native Gitea cache (act-cache-server) requires TCP connectivity from job containers to
|
||||
the runner host on port 39913. In bridge network mode, this is the Docker bridge gateway IP
|
||||
(typically 172.17.0.1). Firewall rules or network configuration that block this port cause
|
||||
all cache steps to time out (ETIMEDOUT), making every build a full cold build.
|
||||
|
||||
## Assumptions
|
||||
|
||||
- The Unraid host remains operational and connected to the LAN for all builds.
|
||||
- git.wylab.me TLS certificate is valid; Traefik handles renewal.
|
||||
- The wylab-station-14 repository is a standard C# .NET project buildable with `dotnet build`.
|
||||
- The target Docker image for the SS14 server is self-contained (all dependencies bundled in the image).
|
||||
- No GPU or special hardware is required for the build process (only for the game server at runtime).
|
||||
|
||||
## Known Limitations
|
||||
|
||||
- **No deployment automation**: The pipeline builds and pushes the Docker image but does not automatically restart the Unraid SS14 container. Deployment is manual (Unraid Docker UI or shell).
|
||||
- **No test step documented**: Source logs do not document a unit test step in the workflow. SS14 may have tests but they are not part of the captured pipeline configuration.
|
||||
- **Single point of failure**: If the Unraid runner is the only registered runner (after abandoning external VPS and Mac runner), any Unraid maintenance window blocks all builds.
|
||||
- **No artifact versioning strategy**: The exact tagging scheme for Docker images (e.g., `:latest` vs. `:commit-sha` vs. `:semver`) is not specified in source logs.
|
||||
- **Runner registration tokens expire**: Gitea runner tokens are single-use; re-registering a runner requires a new token from the Gitea admin UI.
|
||||
@@ -0,0 +1,36 @@
|
||||
# Heuristics
|
||||
|
||||
## H01: Always encode runner architecture in .NET cache keys
|
||||
- **Rationale**: dotnet build artifacts (NuGet packages, MSBuild output) are architecture-specific. A cache key based only on project file hashes will cause cross-architecture pollution when multiple runner architectures share a cache backend. The failure is silent: builds produce wrong-arch binaries with a green CI status. Encoding `{{ runner.arch }}` in the key creates architecture-isolated cache namespaces at zero cost.
|
||||
- **Sensitivity**: high — omitting this causes silent artifact corruption, the most dangerous failure mode in this pipeline
|
||||
- **Bounds**: Required whenever ≥2 runner architectures exist in the runner pool. Redundant (but harmless) when all runners are pinned to one architecture via labels.
|
||||
- **Code ref**: [src/execution/cache_key_strategy.py](../../src/execution/cache_key_strategy.py)
|
||||
- **Source**: Session logs 2025-12-18 / 2025-12-19; stated as primary unresolved issue; documented explicitly in project dead ends
|
||||
|
||||
## H02: Set shutdown_timeout: 30m in runner config to prevent zombie containers
|
||||
- **Rationale**: act-runner jobs that are cancelled, timeout, or fail mid-step leave Docker containers running if no shutdown timeout is configured. These zombie containers accumulate over time, consuming disk space, memory, and Docker namespace resources. A 30-minute timeout ensures containers are killed after the maximum expected job duration.
|
||||
- **Sensitivity**: medium — impact grows over time; initially invisible, becomes critical after days of accumulated zombies
|
||||
- **Bounds**: 30 minutes is a reasonable upper bound for dotnet build + docker build operations in this codebase. Too low (e.g., 5m) will kill legitimate long builds. Too high (e.g., 24h) provides no protection.
|
||||
- **Code ref**: [src/configs/training.md](../../src/configs/training.md)
|
||||
- **Source**: Session logs 2025-12-15 and 2025-12-19; applied to both external runner and Mac runner
|
||||
|
||||
## H03: Cap concurrent job capacity at 2 for dotnet builds on memory-constrained runners
|
||||
- **Rationale**: dotnet's MSBuild build server and NuGet restore are memory-intensive. On a runner host without swap (OrbStack on macOS), concurrent jobs multiply memory usage without the safety valve of swap, triggering OOM kills. The empirically safe limit for this workload on the available hardware is 2.
|
||||
- **Sensitivity**: high — exceeding this causes OOM crashes that interrupt all running builds
|
||||
- **Bounds**: Capacity=2 is the stable floor for the SS14 codebase on OrbStack. May safely increase on Linux runners with swap. Never exceed 2 on OrbStack without swap.
|
||||
- **Code ref**: [src/configs/training.md](../../src/configs/training.md)
|
||||
- **Source**: Session log 2025-12-19; capacity reduction sequence 6 → 4 → 3 → 2 explicitly documented
|
||||
|
||||
## H04: Use local file cache instead of act-cache-server for reliability
|
||||
- **Rationale**: Gitea's built-in act-cache-server relies on TCP connectivity from job containers to the runner host on port 39913. This fails with ETIMEDOUT when network configuration is incorrect, causing every build to be a full cold build (5 minutes for .NET packages instead of 5 seconds). Local file cache bypasses the HTTP protocol entirely by mounting a host directory into the job container — no network required, no port to configure.
|
||||
- **Sensitivity**: medium — the cache server works correctly when network config is right; local file cache is just simpler and more reliable
|
||||
- **Bounds**: Local file cache works well for single-runner setups. For multi-runner setups sharing a cache, the local files must be on a shared mount — which reintroduces the need for architecture-tagged keys (H01). For distributed teams with many runners, a properly configured remote cache may be preferable.
|
||||
- **Code ref**: [src/configs/training.md](../../src/configs/training.md)
|
||||
- **Source**: Session log 2025-12-15 (ETIMEDOUT failure); session 2025-12-19 (switched to local file cache)
|
||||
|
||||
## H05: Register only one runner per host; isolate runners by runner name and port
|
||||
- **Rationale**: Running two act-runner instances on the same host naïvely (without port/socket isolation) causes runner conflicts and breaks the existing runner. Each act-runner instance needs distinct config.yml paths, work directories, and HTTP ports to coexist. Simpler: use a single runner per host and tune its concurrency instead of adding a second instance.
|
||||
- **Sensitivity**: high — second naive registration immediately breaks the first runner
|
||||
- **Bounds**: If multiple runners on one host are truly needed, they must have distinct: (1) runner name, (2) config.yml path, (3) work directory, (4) listener port. One runner per host is simpler and sufficient for typical homelab workloads.
|
||||
- **Code ref**: [src/configs/training.md](../../src/configs/training.md)
|
||||
- **Source**: Session log 2025-12-18; "Deleted runner 2, reverted everything after frustration" — explicit documentation of this failure
|
||||
@@ -0,0 +1,46 @@
|
||||
# Workflow and Build Configuration
|
||||
|
||||
These parameters describe the Gitea Actions workflow YAML configuration for the wylab-station-14
|
||||
build pipeline.
|
||||
|
||||
## runs-on (runner label selector)
|
||||
- **Value**: `unraid` (recommended, pinned to x86-64)
|
||||
- **Rationale**: Pins all build jobs to the Unraid x86-64 runner, preventing arm64 artifact generation and eliminating cross-architecture cache corruption (C01, C05).
|
||||
- **Search range**: Tested: `ubuntu-latest` (no matching runner), `self-hosted` (matches all runners — too broad), `unraid` (specific, correct)
|
||||
- **Sensitivity**: high — wrong label causes jobs to queue indefinitely or route to wrong architecture
|
||||
- **Source**: Derived from C05 fix
|
||||
|
||||
## cache key template (dotnet)
|
||||
- **Value**: `dotnet-${{ runner.arch }}-${{ hashFiles('**/*.csproj') }}`
|
||||
- **Rationale**: Architecture-safe cache key. `runner.arch` ensures arm64 and amd64 cache entries are isolated. `hashFiles('**/*.csproj')` ensures cache invalidation when dependencies change.
|
||||
- **Search range**: Tested (failure): `dotnet-${{ hashFiles('**/*.csproj') }}` (no arch — causes C01 failure)
|
||||
- **Sensitivity**: high — omitting runner.arch causes silent cache corruption
|
||||
- **Source**: Derived from C01 root cause analysis; H01
|
||||
|
||||
## dotnet build configuration
|
||||
- **Value**: `--configuration Release`
|
||||
- **Rationale**: Release builds are required for production server deployment; Debug builds include debug symbols and may have different performance characteristics.
|
||||
- **Search range**: Debug (dev only), Release (production)
|
||||
- **Sensitivity**: low — functionally correct either way; Release is standard for production
|
||||
- **Source**: Standard .NET practice; implied by production deployment context
|
||||
|
||||
## container DNS option
|
||||
- **Value**: `--dns 172.17.0.1` (Docker bridge gateway)
|
||||
- **Rationale**: Allows runner job containers to resolve internal hostnames (git.wylab.me) via Technitium DNS on the Unraid host, without requiring host network mode.
|
||||
- **Search range**: Tested: no custom DNS (failed), 1.1.1.1 (failed for private hostnames), host network (1/6 success, inconsistent), 172.17.0.1 (recommended, untested at time of writing)
|
||||
- **Sensitivity**: high — wrong DNS causes checkout step failure
|
||||
- **Source**: Session log 2025-12-14 (failure modes); derived fix from Docker networking model
|
||||
|
||||
## dotnet SDK version
|
||||
- **Value**: Matching upstream space-wizards/space-station-14 requirements (exact version not captured)
|
||||
- **Rationale**: SS14 is a C# project requiring .NET SDK; version must match the project's global.json or TargetFramework declarations.
|
||||
- **Search range**: .NET 7.x, 8.x, 9.x depending on upstream SS14 fork version; wylab-station-14 may pin a specific version
|
||||
- **Sensitivity**: high — wrong SDK version causes build failures
|
||||
- **Source**: Inferred from SS14 codebase characteristics; exact version not specified in session logs
|
||||
|
||||
## docker build target architecture
|
||||
- **Value**: `linux/amd64`
|
||||
- **Rationale**: Unraid production server is x86-64. Image must be built for `linux/amd64` to run correctly.
|
||||
- **Search range**: linux/amd64 (correct), linux/arm64 (wrong for Unraid, what arm64 runner produces by default)
|
||||
- **Sensitivity**: high — wrong architecture produces silent runtime failure or crash
|
||||
- **Source**: Constraint BC1; Unraid hardware spec (UM790 Pro, x86-64)
|
||||
@@ -0,0 +1,47 @@
|
||||
# Runner Configuration Parameters
|
||||
|
||||
These are the act-runner config.yml parameters discovered and tuned during the ss14-cicd pipeline
|
||||
stabilization. "Training" here refers to the CI/CD runner configuration, analogous to training
|
||||
hyperparameters in ML contexts.
|
||||
|
||||
## capacity (concurrent job limit)
|
||||
- **Value**: 2
|
||||
- **Rationale**: Empirically derived from OOM crash sequence on macOS ARM64 OrbStack runner. dotnet builds are memory-intensive; OrbStack has no swap. At capacity=3, OOM kills occurred. At capacity=2, stable operation achieved.
|
||||
- **Search range**: Tested: 6, 4, 3, 2. Only 2 was stable for this workload.
|
||||
- **Sensitivity**: high — exceeding 2 causes OOM on OrbStack; may be increased on Linux runners with swap
|
||||
- **Source**: Session log 2025-12-19
|
||||
|
||||
## shutdown_timeout
|
||||
- **Value**: 30m
|
||||
- **Rationale**: Prevents zombie containers from accumulating when jobs are cancelled or time out mid-step. Without this, Docker containers from failed jobs remain running indefinitely.
|
||||
- **Search range**: Not explicitly searched; 30m selected as upper bound for expected build duration
|
||||
- **Sensitivity**: medium — too low kills legitimate builds; too high allows zombie accumulation
|
||||
- **Source**: Session logs 2025-12-15 (added to external runner), 2025-12-19 (applied to Mac runner)
|
||||
|
||||
## cache.enabled (local file cache)
|
||||
- **Value**: true (local file cache)
|
||||
- **Rationale**: Native act-cache-server (remote) timed out with ETIMEDOUT on port 39913. Local file cache bypasses HTTP protocol entirely, providing reliable cache hits.
|
||||
- **Search range**: Tested: act-cache-server (remote, ETIMEDOUT), local file cache (stable)
|
||||
- **Sensitivity**: medium — remote cache could work with correct network config, but local file cache is simpler
|
||||
- **Source**: Session logs 2025-12-15 (ETIMEDOUT failure), 2025-12-19 (switched to local)
|
||||
|
||||
## container.network
|
||||
- **Value**: host (partially tested, not finalized)
|
||||
- **Rationale**: Bridge network mode causes DNS resolution failure for git.wylab.me. Host network mode gave 1/6 job success during testing (inconsistent). Recommended fix: use bridge with `container.dns: ["172.17.0.1"]` instead.
|
||||
- **Search range**: Tested: bridge (default, failed), host (1/6 success)
|
||||
- **Sensitivity**: high — wrong setting causes pipeline non-triggering or DNS failures
|
||||
- **Source**: Session log 2025-12-14
|
||||
|
||||
## labels
|
||||
- **Value**: `[self-hosted, linux, unraid]` (recommended for Unraid runner)
|
||||
- **Rationale**: Unique architecture-specific labels enable runner label pinning in workflow YAML (`runs-on: unraid`). This prevents jobs from being dispatched to arm64 runners.
|
||||
- **Search range**: Not explicitly explored; any unique label works
|
||||
- **Sensitivity**: high — labels must be unique per architecture to enable effective pinning
|
||||
- **Source**: Derived from C05 (architecture pinning solution)
|
||||
|
||||
## runner registration token
|
||||
- **Value**: YCbZPZWAGg2iJrgL20dnsf8sRLASexJWAcv9VvW5 (Mac runner, 2025-12-18 — historical, likely rotated)
|
||||
- **Rationale**: Single-use registration token from Gitea admin UI. Required once per runner instance.
|
||||
- **Search range**: N/A — generated by Gitea
|
||||
- **Sensitivity**: low — token only used at registration time, not during builds
|
||||
- **Source**: Session log 2025-12-18
|
||||
@@ -0,0 +1,52 @@
|
||||
# Environment
|
||||
|
||||
## Build Stack
|
||||
|
||||
- **Runtime**: .NET SDK (C# — exact version matches upstream space-wizards/space-station-14; likely .NET 7.x or 8.x at the time of this project)
|
||||
- **Framework**: Not applicable (build pipeline, not ML framework)
|
||||
- **Hardware**:
|
||||
- Deployment target: Unraid server, UM790 Pro, 32GB RAM, x86-64 (amd64)
|
||||
- Runner option A (preferred): Unraid host act-runner, x86-64, direct Docker access
|
||||
- Runner option B (abandoned): External VPS 45.137.68.83, x86-64, Contabo Düsseldorf
|
||||
- Runner option C (problematic): macOS Apple Silicon, ARM64, OrbStack VM, no swap
|
||||
|
||||
## Key Dependencies
|
||||
|
||||
| Tool | Version | Notes |
|
||||
|------|---------|-------|
|
||||
| act-runner | Latest at 2025-12 | Gitea's official runner; Container executor mode |
|
||||
| Docker | Host version | Required by act-runner for job container spawning |
|
||||
| .NET SDK | Matches SS14 fork | C# build toolchain; high memory footprint |
|
||||
| Node.js | Required for some workflow actions | npm also required |
|
||||
| pip/Python | Required for some workflow steps | Not primary build tool |
|
||||
| OrbStack | Latest at 2025-12 | macOS Docker provider (ARM64 runner); replaced Colima |
|
||||
| Gitea | Self-hosted at git.wylab.me | Workflow dispatcher; acts as CI server |
|
||||
| Traefik | Self-hosted on Unraid | TLS termination and reverse proxy for git.wylab.me |
|
||||
| Technitium DNS | 192.168.1.50 | Internal DNS resolver for *.wylab.me hostnames |
|
||||
|
||||
## Configuration Files
|
||||
|
||||
| File | Path | Purpose |
|
||||
|------|------|---------|
|
||||
| act-runner config | `config.yml` on runner host | capacity, shutdown_timeout, labels, cache mode |
|
||||
| Gitea workflow | `.gitea/workflows/build.yaml` | Job definition: runs-on, cache key, build steps |
|
||||
| Docker daemon config | `/etc/docker/daemon.json` on Unraid | DNS settings: `{"dns": ["172.17.0.1"]}` |
|
||||
|
||||
## Network Topology
|
||||
|
||||
| Host | IP | Role |
|
||||
|------|----|------|
|
||||
| Unraid server | 192.168.1.50 | Gitea container, Docker daemon, deployment target |
|
||||
| Technitium DNS | 192.168.1.50 | Internal DNS resolver (same host, different container) |
|
||||
| Docker bridge gateway | 172.17.0.1 | DNS forwarding point for job containers |
|
||||
| External VPS (abandoned) | 45.137.68.83 | Former runner host; later used as CDN mirror |
|
||||
| Developer MacBook | LAN | Git push origin; former ARM64 runner host |
|
||||
|
||||
## Random Seeds
|
||||
- Not applicable — deterministic build pipeline
|
||||
|
||||
## Known Environment Issues
|
||||
|
||||
1. **OrbStack no-swap**: macOS manages memory; OrbStack VM has no swap partition. OOM kills are abrupt.
|
||||
2. **Docker DNS default**: New Unraid Docker installations may default to `{"dns": ["8.8.8.8"]}` which cannot resolve internal hostnames. Override with `172.17.0.1` in `daemon.json`.
|
||||
3. **act-runner cache/act/ directory**: Node.js modules cached in `/opt/gitea-runner/.cache/act/` on the VPS runner became corrupted. Manual deletion required to recover.
|
||||
@@ -0,0 +1,188 @@
|
||||
"""
|
||||
cache_key_strategy.py
|
||||
|
||||
Architecture-safe cache key generation for Gitea Actions / act-runner builds.
|
||||
|
||||
This module implements the core logic for generating .NET build cache keys that are
|
||||
guaranteed to be architecture-specific, preventing cross-architecture cache pollution
|
||||
(the silent failure mode described in C01).
|
||||
|
||||
The key insight: cache keys must be injective in runner architecture. Two runners
|
||||
with different architectures but the same project file hash MUST produce different keys.
|
||||
|
||||
Supports both:
|
||||
- Architecture-tagged keys: embed runner.arch in the key (recommended)
|
||||
- Runner label pinning: restrict jobs to one architecture (belt-and-suspenders)
|
||||
"""
|
||||
|
||||
import hashlib
|
||||
import os
|
||||
import platform
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
|
||||
def get_runner_architecture() -> str:
|
||||
"""
|
||||
Return the current runner's CPU architecture, normalized to Gitea Actions context values.
|
||||
|
||||
Returns:
|
||||
str: One of "amd64", "arm64", "arm", "386" — matching Gitea's runner.arch values.
|
||||
|
||||
Note:
|
||||
In Gitea Actions YAML, this is accessible as `${{ runner.arch }}`.
|
||||
This Python function is the equivalent for local scripts or pre-step checks.
|
||||
"""
|
||||
machine = platform.machine().lower()
|
||||
arch_map = {
|
||||
"x86_64": "amd64",
|
||||
"amd64": "amd64",
|
||||
"aarch64": "arm64",
|
||||
"arm64": "arm64",
|
||||
"armv7l": "arm",
|
||||
"i386": "386",
|
||||
"i686": "386",
|
||||
}
|
||||
return arch_map.get(machine, machine)
|
||||
|
||||
|
||||
def hash_project_files(
|
||||
root_dir: str,
|
||||
patterns: list[str] = ("**/*.csproj", "**/global.json", "**/packages.lock.json"),
|
||||
) -> str:
|
||||
"""
|
||||
Compute a deterministic hash of all project dependency files matching the given patterns.
|
||||
|
||||
Equivalent to Gitea Actions expression: `${{ hashFiles('**/*.csproj') }}`
|
||||
|
||||
Args:
|
||||
root_dir: Root directory to search for project files (typically repo root).
|
||||
patterns: Glob patterns for files to include in hash.
|
||||
Default: *.csproj + global.json + packages.lock.json
|
||||
|
||||
Returns:
|
||||
str: Hex SHA-256 digest of all matching files' contents, sorted by path for determinism.
|
||||
"""
|
||||
root = Path(root_dir)
|
||||
matched_files: list[Path] = []
|
||||
|
||||
for pattern in patterns:
|
||||
matched_files.extend(root.glob(pattern))
|
||||
|
||||
# Sort for determinism — same set of files must always produce same hash
|
||||
matched_files = sorted(set(matched_files))
|
||||
|
||||
hasher = hashlib.sha256()
|
||||
for filepath in matched_files:
|
||||
if filepath.is_file():
|
||||
# Include relative path in hash to detect file renames
|
||||
hasher.update(str(filepath.relative_to(root)).encode())
|
||||
hasher.update(filepath.read_bytes())
|
||||
|
||||
return hasher.hexdigest()[:16] # 16 hex chars = 64 bits, sufficient for cache keys
|
||||
|
||||
|
||||
def make_dotnet_cache_key(
|
||||
root_dir: str,
|
||||
arch: Optional[str] = None,
|
||||
prefix: str = "dotnet",
|
||||
) -> str:
|
||||
"""
|
||||
Generate an architecture-safe .NET NuGet package cache key.
|
||||
|
||||
This is the CORE FIX for C01 (silent cross-architecture cache corruption).
|
||||
The returned key is injective in architecture: different archs always produce different keys.
|
||||
|
||||
Args:
|
||||
root_dir: Repository root directory for project file discovery.
|
||||
arch: Runner architecture. If None, auto-detected from current system.
|
||||
In Gitea Actions YAML, use: ${{ runner.arch }}-${{ hashFiles('**/*.csproj') }}
|
||||
prefix: Cache key prefix (default: "dotnet").
|
||||
|
||||
Returns:
|
||||
str: Cache key in format "{prefix}-{arch}-{project_hash}".
|
||||
Example: "dotnet-amd64-a3f2b91c4d8e7f01"
|
||||
|
||||
Raises:
|
||||
ValueError: If arch is empty or None after auto-detection.
|
||||
|
||||
Example:
|
||||
>>> key = make_dotnet_cache_key("/repo", arch="amd64")
|
||||
>>> print(key) # "dotnet-amd64-a3f2b91c4d8e7f01"
|
||||
>>>
|
||||
>>> # In Gitea Actions YAML (preferred — uses native expression):
|
||||
>>> # key: dotnet-${{ runner.arch }}-${{ hashFiles('**/*.csproj') }}
|
||||
"""
|
||||
resolved_arch = arch or get_runner_architecture()
|
||||
if not resolved_arch:
|
||||
raise ValueError("Could not determine runner architecture for cache key generation")
|
||||
|
||||
project_hash = hash_project_files(root_dir)
|
||||
return f"{prefix}-{resolved_arch}-{project_hash}"
|
||||
|
||||
|
||||
def validate_cache_key_architecture_safety(key: str) -> dict:
|
||||
"""
|
||||
Validate that a cache key string includes an architecture component.
|
||||
|
||||
A "safe" cache key must contain at least one of the known architecture strings.
|
||||
This is a static check — it cannot catch dynamic expression templates at YAML parse time.
|
||||
|
||||
Args:
|
||||
key: Cache key string or template to validate.
|
||||
|
||||
Returns:
|
||||
dict with keys:
|
||||
- safe (bool): True if architecture is present in the key
|
||||
- detected_arch (str | None): The architecture string found, if any
|
||||
- warning (str | None): Human-readable warning if not safe
|
||||
|
||||
Example:
|
||||
>>> validate_cache_key_architecture_safety("dotnet-amd64-abc123")
|
||||
{"safe": True, "detected_arch": "amd64", "warning": None}
|
||||
>>> validate_cache_key_architecture_safety("dotnet-abc123")
|
||||
{"safe": False, "detected_arch": None, "warning": "Cache key missing architecture..."}
|
||||
"""
|
||||
known_archs = ["amd64", "arm64", "arm", "386", "x86_64", "aarch64",
|
||||
"runner.arch", "{{ runner.arch }}"]
|
||||
|
||||
for arch in known_archs:
|
||||
if arch in key:
|
||||
return {"safe": True, "detected_arch": arch, "warning": None}
|
||||
|
||||
return {
|
||||
"safe": False,
|
||||
"detected_arch": None,
|
||||
"warning": (
|
||||
f"Cache key '{key}' does not contain an architecture component. "
|
||||
"This will cause cross-architecture cache pollution when both arm64 and amd64 "
|
||||
"runners share a cache backend. Add '${{ runner.arch }}-' to the key. "
|
||||
"See: C01, H01 in ss14-cicd ARA."
|
||||
),
|
||||
}
|
||||
|
||||
|
||||
# --- Gitea Actions YAML snippets (for reference, not executed) ---
|
||||
|
||||
RECOMMENDED_CACHE_STEP = """
|
||||
# Recommended cache step for wylab-station-14 build workflow
|
||||
# Paste into .gitea/workflows/build.yaml
|
||||
|
||||
- name: Cache .NET NuGet packages
|
||||
uses: actions/cache@v3
|
||||
with:
|
||||
path: ~/.nuget/packages
|
||||
# Architecture-safe key: arm64 and amd64 never share cache entries (C01 fix)
|
||||
key: dotnet-${{ runner.arch }}-${{ hashFiles('**/*.csproj', '**/global.json') }}
|
||||
restore-keys: |
|
||||
dotnet-${{ runner.arch }}-
|
||||
"""
|
||||
|
||||
RECOMMENDED_RUNNER_LABEL_PIN = """
|
||||
# Recommended runner label pin for wylab-station-14 build workflow
|
||||
# Ensures ALL builds run on x86-64 Unraid runner (C05 fix)
|
||||
|
||||
jobs:
|
||||
build:
|
||||
runs-on: unraid # matches runner with labels: [unraid, self-hosted, linux]
|
||||
"""
|
||||
@@ -0,0 +1,157 @@
|
||||
# Exploration Tree — ss14-cicd
|
||||
# Research DAG: nested tree with cross-edges (also_depends_on) forming a DAG.
|
||||
# Source: Session logs 2025-12-14 through 2025-12-19 (HISTORY.md)
|
||||
# Node types: question | experiment | dead_end | decision | pivot
|
||||
|
||||
tree:
|
||||
- id: N01
|
||||
type: question
|
||||
support_level: explicit
|
||||
source_refs: ["HISTORY.md: 2025-12-14"]
|
||||
title: "How to build wylab-station-14 Docker image automatically on commit?"
|
||||
description: >
|
||||
The wylab-station-14 Space Station 14 game server needs automated builds
|
||||
via Gitea Actions on git.wylab.me. Commits to the repo should trigger a
|
||||
pipeline that builds the Docker image and makes it available for deployment
|
||||
on the Unraid server.
|
||||
children:
|
||||
|
||||
- id: N02
|
||||
type: experiment
|
||||
support_level: explicit
|
||||
source_refs: ["HISTORY.md: 2025-12-14"]
|
||||
title: "Set up act-runner on Unraid (container mode, bridge network)"
|
||||
result: >
|
||||
Pipeline did not trigger on commits. Runner registered successfully but
|
||||
job containers could not resolve git.wylab.me — DNS resolution failed
|
||||
inside the runner job containers. Host networking partially worked:
|
||||
1 out of 6 jobs succeeded, inconsistently.
|
||||
evidence: ["C03", "HISTORY.md: 2025-12-14"]
|
||||
children:
|
||||
|
||||
- id: N03
|
||||
type: dead_end
|
||||
support_level: explicit
|
||||
source_refs: ["HISTORY.md: 2025-12-14"]
|
||||
title: "Adding 1.1.1.1 DNS to runner container"
|
||||
hypothesis: "External DNS resolver will allow git.wylab.me resolution from job containers"
|
||||
failure_mode: >
|
||||
1.1.1.1 cannot resolve private internal hostnames. git.wylab.me is only
|
||||
resolvable via Technitium DNS at 192.168.1.50. Public DNS has no record for it.
|
||||
lesson: >
|
||||
DNS config must target job containers specifically (not the runner process container),
|
||||
AND must point to an internal resolver that knows git.wylab.me.
|
||||
|
||||
- id: N04
|
||||
type: dead_end
|
||||
support_level: explicit
|
||||
source_refs: ["HISTORY.md: 2025-12-14"]
|
||||
title: "Host networking mode on Unraid runner"
|
||||
hypothesis: "Host network mode gives job containers access to host DNS resolver"
|
||||
failure_mode: >
|
||||
Inconsistent: only 1 of 6 job containers successfully resolved hostnames.
|
||||
Multiple reverts required. Exact root cause of inconsistency not determined.
|
||||
Reverted changes.
|
||||
lesson: >
|
||||
Host networking is not a stable fix. The correct fix is to configure
|
||||
container.dns in runner config.yml to point at Docker bridge gateway (172.17.0.1)
|
||||
which forwards to Technitium.
|
||||
|
||||
- id: N05
|
||||
type: decision
|
||||
support_level: explicit
|
||||
source_refs: ["HISTORY.md: 2025-12-15"]
|
||||
title: "Move to external VPS runner to bypass Unraid networking constraints"
|
||||
choice: "Deploy act-runner on external VPS (45.137.68.83) with direct internet access"
|
||||
alternatives:
|
||||
- "Continue debugging Unraid container runner DNS"
|
||||
- "Use host networking on Unraid runner (unstable)"
|
||||
- "Switch to a different CI system (Drone CI, Jenkins)"
|
||||
evidence: "1/6 success rate on Unraid was deemed insufficient; external VPS has direct DNS access"
|
||||
children:
|
||||
|
||||
- id: N06
|
||||
type: experiment
|
||||
support_level: explicit
|
||||
source_refs: ["HISTORY.md: 2025-12-15"]
|
||||
title: "External VPS runner (45.137.68.83) with native Gitea cache"
|
||||
result: >
|
||||
Runner started but had persistent Node.js module errors: "Cannot find module"
|
||||
in /opt/gitea-runner/.cache/act/. SSH debugging sessions required.
|
||||
Native Gitea cache server (act-cache-server) timed out: ETIMEDOUT on
|
||||
45.137.68.83:39913. .NET cache step took 5 minutes vs 5 seconds for other steps —
|
||||
indicating full cache miss every build.
|
||||
evidence: ["C02", "C03", "HISTORY.md: 2025-12-15"]
|
||||
children:
|
||||
|
||||
- id: N07
|
||||
type: dead_end
|
||||
support_level: explicit
|
||||
source_refs: ["HISTORY.md: 2025-12-15"]
|
||||
title: "Native Gitea act-cache-server on external VPS"
|
||||
hypothesis: "Native Gitea cache server will provide fast cache hits for .NET packages"
|
||||
failure_mode: >
|
||||
ETIMEDOUT connecting to 45.137.68.83:39913 from inside job containers.
|
||||
Every build was a full cold build (5 min cache step vs 5 sec).
|
||||
Likely cause: firewall or Docker bridge network blocking port 39913.
|
||||
lesson: >
|
||||
Native act-cache-server requires port 39913 reachable from job containers.
|
||||
Local file cache (volume mount) is more reliable — bypasses HTTP protocol.
|
||||
|
||||
- id: N08
|
||||
type: decision
|
||||
support_level: explicit
|
||||
source_refs: ["HISTORY.md: 2025-12-18"]
|
||||
title: "Add macOS ARM64 runner (OrbStack) as additional runner"
|
||||
choice: "Register Mac ARM64 OrbStack runner to supplement or replace external VPS runner"
|
||||
alternatives:
|
||||
- "Continue debugging external VPS runner"
|
||||
- "Return to Unraid container runner with DNS fix"
|
||||
- "Replace Gitea Actions with alternative CI (Drone, Jenkins, GitHub Actions)"
|
||||
evidence: "External VPS runner crashing under load; developer Mac available with Docker via OrbStack"
|
||||
children:
|
||||
|
||||
- id: N09
|
||||
type: experiment
|
||||
support_level: explicit
|
||||
source_refs: ["HISTORY.md: 2025-12-18", "HISTORY.md: 2025-12-19"]
|
||||
title: "macOS ARM64 OrbStack runner — capacity and cache tuning"
|
||||
result: >
|
||||
Runner worked initially. OOM crashes under concurrent dotnet builds forced
|
||||
capacity reduction: 6 → 4 → 3 → 2 concurrent jobs. Local file cache
|
||||
configured (stable). shutdown_timeout: 30m added to prevent zombie containers.
|
||||
However: mixed architecture with Unraid runner caused cache corruption —
|
||||
arm64 cache entries consumed by x86-64 jobs, producing wrong-arch artifacts
|
||||
silently. Runner kept crashing under load. Unresolved as of 2025-12-19.
|
||||
evidence: ["C01", "C04", "HISTORY.md: 2025-12-19"]
|
||||
children:
|
||||
|
||||
- id: N10
|
||||
type: dead_end
|
||||
support_level: explicit
|
||||
source_refs: ["HISTORY.md: 2025-12-18"]
|
||||
title: "Running two act-runner instances on same Mac host"
|
||||
hypothesis: "Second runner instance increases parallel capacity"
|
||||
failure_mode: >
|
||||
Second runner registration immediately broke the existing first runner.
|
||||
Port/socket conflicts between instances. Had to delete runner 2 and revert.
|
||||
lesson: >
|
||||
Multiple act-runner instances on same host require distinct ports,
|
||||
work directories, and config paths. Simpler: one runner per host,
|
||||
tune capacity instead.
|
||||
|
||||
- id: N11
|
||||
type: decision
|
||||
support_level: inferred
|
||||
title: "Architecture-tagged cache keys as fix for silent cache corruption"
|
||||
choice: >
|
||||
Encode runner.arch in cache keys: 'dotnet-${{ runner.arch }}-${{ hashFiles(...) }}'
|
||||
OR pin all builds to Unraid runner via runs-on label
|
||||
alternatives:
|
||||
- "Continue with architecture-agnostic keys (known failure mode)"
|
||||
- "Abandon Mac runner entirely, return to Unraid-only runner"
|
||||
- "Use separate cache backends per runner (NFS mount isolation)"
|
||||
evidence: >
|
||||
Root cause analysis: arm64 and amd64 runners with identical
|
||||
project file hashes produce identical cache keys but incompatible artifacts.
|
||||
Architecture in key makes the key injective in architecture — no collision possible.
|
||||
Reference in New Issue
Block a user