\u2192 in YAML; live test finding from WYL-72 round 1
Judge-Gemini (on github-copilot/gemini-3.1-pro-preview) emitted reports with \` (backslash-backtick) inside double-quoted YAML strings, imitating markdown escaping (e.g. evidence: "see \`foo.py\`"). \` is not a valid YAML escape sequence, so yaml.safe_load rejected the entire report. Judge-GPT did not make this mistake, so the consensus degenerated from 3-to-2 parseable reports to 1-to-2 (and therefore produced a spurious single-judge convergence). The fix is a targeted cleanup in _extract_yaml: replace \` with literal ` before parsing. No other interpretation of \` exists in YAML, so this does not mask semantics. Tests: 71 passed (added 2 for the new cleanup path). Separately surfaced (not yet addressed in this commit): - Judge-Claude on github-copilot/claude-sonnet-4.6 and github-copilot/claude-opus-4.5 returned model_not_supported. Per GitHub Copilot Student plan docs (2026-03-13 update), Claude Opus/Sonnet are no longer student-selectable. Tower-Copilot-Claude now runs claude-haiku-4.5 (which is student-accessible).
coordinator
Tool-MAD middle-management layer for multica.
Why this exists
Multica's agents produce work and self-set issue status to in_review. Nothing in multica ever looks at that state — it's a dead end. Reviewers, when mentioned manually, tend to catch surface issues (length, structure) and miss semantic ones (scope drift, fabricated experiments, answering the wrong question).
This daemon is the missing middle-management layer. When an issue transitions to in_review, it:
- Convenes a debate round: posts one comment on the issue mentioning a fixed set of debaters, each given a role-specific evidence-gathering prompt (e.g. Senior Developer greps the committed code for LLM API calls; Code Reviewer diffs the described method against the source; Project Manager Senior checks scope satisfaction against the original description).
- Waits for every debater to reply (up to a timeout).
- Posts a second comment mentioning the judge (Reality Checker), with the assembled debater transcript + the original issue body, and a hardcoded decision rule.
- Parses the judge's structured verdict (
VERDICT: ACCEPTorVERDICT: REJECT\n- R<n>: <failure>) and acts:ACCEPT→ PUT statusdone, post acceptance summaryREJECT→ PUT statusin_progress, post rejection listing every failure, re-trigger the original assignee
The pattern is Tool-MAD (Multi-Agent Debate with heterogeneous tool augmentation). Reference: arxiv 2601.04742.
Why debaters catch what a single judge misses
Each debater runs on a different agent (different runtime, different tool access, different role prompt) and is forced to ground its argument in a specific tool's output — not in its own recollection of the text. Example: if the question is "does this paper describe real LLM agent experiments", Senior Developer's grep for anthropic|openai|ollama|model= in the committed code either returns hits or it doesn't — that's a hard fact, not an opinion. The judge reads 4 grounded arguments and applies the decision rule "any debater reporting evidence of scope drift ⇒ REJECT."
A naive LLM-as-a-judge reviewer reads the paper and scores it on surface dimensions. An agent-as-judge driven by debaters catches the underlying substitution. See WYL-41 for the live failure case that motivated this build.
Run
pip install -e .
cat > ~/.coordinator/env <<'EOF'
COORDINATOR_SERVER_URL=http://localhost:8089
COORDINATOR_WORKSPACE_ID=<wid>
COORDINATOR_TOKEN=<coordinator-member-pat>
EOF
chmod 600 ~/.coordinator/env
coordinator
Logs go to stderr and to ~/.coordinator/coordinator.log. State file is ~/.coordinator/seen.json.
Status
- WYL-42 skeleton + watcher (this commit)
- WYL-43 dedicated admin PAT
- WYL-44 hook watcher to real round trigger
- WYL-45 debate round orchestration
- WYL-46 verdict parser + action executor
- WYL-47 dry-run against WYL-41