Live round on WYL-72 exposed that my debate-round comment produced
sycophancy, not debate. Judge-Gemini moved from a 4.0 ACCEPT stance to
"I agree my initial score was too lenient... I have adjusted my scores
to 3s and 2s" after reading Judge-GPT's 2.92 report — without defending
its original scores or challenging GPT's evidence. Classic social-pressure
convergence.
Root cause: my debate comment said "You may hold your position if you
have new evidence; you may move if you find the other reasoning more
grounded. Do not split the difference to compromise." That phrasing
is both weaker than CEK's intent AND it dropped every structural
anti-sycophancy instruction CEK spelled out in judge-with-debate/SKILL.md:
Missing: "Identify disagreements (where your scores differ by >1 point)"
Missing: "Defend your position with evidence from the specification"
Missing: "Challenge the other judge's position with counter-evidence"
Missing: "Only revise if you find their evidence compelling"
Missing: "Defend your original scores if you still believe them"
Also: I asked judges to post a REVISED report (implicitly retracting
their prior position). CEK asks them to APPEND a debate round section
to their prior report, keeping both visible so the revision is a change
ON TOP OF the original rather than a replacement.
Fixed by porting CEK's instruction block verbatim into _build_debate_round_comment.
Added a regression test that fails if any future edit removes these exact
clauses.
Tests: 72 passed (+1 regression test).
Judge-Gemini (on github-copilot/gemini-3.1-pro-preview) emitted reports
with \` (backslash-backtick) inside double-quoted YAML strings,
imitating markdown escaping (e.g. evidence: "see \`foo.py\`"). \`
is not a valid YAML escape sequence, so yaml.safe_load rejected the entire
report. Judge-GPT did not make this mistake, so the consensus degenerated
from 3-to-2 parseable reports to 1-to-2 (and therefore produced a spurious
single-judge convergence).
The fix is a targeted cleanup in _extract_yaml: replace \` with literal
` before parsing. No other interpretation of \` exists in YAML, so
this does not mask semantics.
Tests: 71 passed (added 2 for the new cleanup path).
Separately surfaced (not yet addressed in this commit):
- Judge-Claude on github-copilot/claude-sonnet-4.6 and
github-copilot/claude-opus-4.5 returned model_not_supported. Per GitHub
Copilot Student plan docs (2026-03-13 update), Claude Opus/Sonnet are no
longer student-selectable. Tower-Copilot-Claude now runs
claude-haiku-4.5 (which is student-accessible).
Multica's comment REST API returns content with HTML entities escaped
(`"` for `"`, `>` for `>`, etc.). Agent replies are plain UTF-8,
so we unescape before extracting and parsing.
Caught by the first live test against WYL-72: Meta-Judge produced a
perfectly valid YAML evaluation specification with CK-001..CK-010 checklist
items, but my parser reported it as malformed because `"` is not a
valid YAML token. The model was fine; the plumbing was wrong.
- src/coordinator/orchestrator.py: _extract_yaml now calls html.unescape first
- tests/test_orchestrator.py: +2 tests covering entity decoding
Tests: 69 passed.
The prior pipeline (4 hand-written debater prompts + 1 judge with my prompt
template) kept missing scope drift because every prompt was mine and the
reviewers were all on the same model tier with correlated priors.
This commit replaces the whole review step with CEK's judge-with-debate
pattern translated to multica-native execution:
pending → awaiting_rubric (meta-judge writes YAML spec from issue alone)
→ awaiting_judges (3 judges on 3 copilot models score independently)
→ consensus check (overall within 0.5, criteria within 1.0)
→ accept or reject OR awaiting_debate rounds up to 3
→ error on malformed YAML or cap hit
Per higher-management direction, we do not deal with a model that cannot
produce YAML: malformed rubric or all-unparseable judge reports fail the
round immediately (no retries, no fallback to hand-written prompts).
The anchor retrigger on REJECT (WYL-51 behaviour) is preserved verbatim.
Agent prompts for meta-judge and the 3 judges come from the CEK agents
themselves (Meta-Judge / Judge-GPT / Judge-Claude / Judge-Gemini) whose
`instructions` field is the CEK meta-judge.md / judge.md files uploaded
byte-for-byte. No prompts are authored in this coordinator's source.
Adds pyyaml dependency.
- src/coordinator/orchestrator.py: rewritten for the new phase machine
- src/coordinator/queue.py: Round extended with rubric_yaml, judge_report_comment_ids, debate_round
- tests/test_orchestrator.py: 40 tests for new pipeline (helpers, parsers, consensus math, phase handlers, race fix, retrigger)
- tests/test_integration.py: removed (tested old debater pipeline)
- pyproject.toml: adds pyyaml
Tests: 67 passed in 0.20s (40 orchestrator + 15 queue + 7 watcher + 5 other).
- multica_client: add _req() helper logging HTTP method, URL, status code,
and response time for every API call; warns with response body on 4xx/5xx
- queue: log load/save errors with file paths; warn on JSON parse failures;
log record count on load
- state: same as queue — log load/save errors and entry count on load
- orchestrator: add exc_info=True to all ERROR-level exception logs for full
tracebacks; upgrade debater/judge near-timeout messages from DEBUG to
WARNING at 80% of the timeout threshold
- __main__: log file paths (queue, seen, log) at startup; add exc_info=True
to poll error; emit periodic heartbeat with active round count every 100 cycles
Each of the 4 debater prompts now includes a second paragraph instructing
the debater to independently judge whether the committed work addresses what
was actually asked, or substituted an easier question. If a scope swap is
detected, debaters are instructed to state "SCOPE SWAP DETECTED" and
recommend REJECT on that basis alone.
Adds _SCOPE_SWAP_CLAUSE constant and unit test asserting "SCOPE SWAP"
appears in every DEBATER_PROMPTS entry.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add _NO_MENTION_RULE to all DEBATER_PROMPTS and JUDGE_PROMPT_TEMPLATE
explicitly instructing agents not to @-mention other agents in their
replies. Mentions trigger Multica's mention-trigger mechanism, causing
cascade tasks from debaters responding to each other's comments.
Plain-name references (e.g. 'Code Reviewer' not '@Code Reviewer') are
still allowed for cross-reference in evidence text.
Replace the round-1 summariser approach with the full anchor pattern:
A. _post_rejection_retrigger(round_, client, issue, verdict_comment_content, logger)
- Renamed from _notify_assignee_on_reject
- Non-agent/no-assignee path: post a non-mentioning coordinator note and return
- Agent path: build full anchor comment via _build_retrigger_comment
B. Anchor comment structure (verbatim, no summarising):
1. [@AssigneeName](mention://agent/<id>)
2. Verdict: REJECT (round <id>)
3. ## ANCHOR — Original requirements + full issue description in blockquote
4. ## Why this was rejected + full judge verdict in blockquote
5. ## Instructions for rework + REWORK_INSTRUCTIONS constant (verbatim)
6. Trailing audit line
C. Round.retrigger_comment_id + DebateQueue.set_retrigger_comment_id
D. 8 required tests (D1–D8): mention, verbatim description, no-drift constant,
member skip, no-assignee skip, accept no-op, id persistence, race regression
E. test_retrigger_on_reject_end_to_end integration test
Removed: _extract_rejection_reasons, _build_rejection_followup (summarisers)
Added: REWORK_INSTRUCTIONS constant, MulticaClient.get_agent_name
Unit: 63 passed. Integration: 1 passed.
After _advance_awaiting_judge kicks an issue back to in_progress on a
REJECT, post a short followup comment that @-mentions the agent assignee
with the verdict, top 2-3 failure reasons, and a retry prompt.
Corner cases handled:
- assignee_type != 'agent' (member or unset) → skip silently
- ACCEPT branch → no notification
- notification failure → logged, round still completes (non-blocking)
New helpers: _extract_rejection_reasons, _build_rejection_followup,
_notify_assignee_on_reject.
+12 tests (5 for _extract_rejection_reasons, 7 for the notify path).
Total: 66 passed.
New module: src/coordinator/orchestrator.py
- DEBATER_NAMES, JUDGE_NAME, DEBATER_PROMPTS, JUDGE_PROMPT_TEMPLATE hardcoded for v1
- Per-debater prompts tell each debater exactly which tool output to ground evidence in
- orchestrate_pending() is the main entry point called from watch_loop
- _start_round(): pending→running, posts debater mention comment, phase→awaiting_debaters
- _advance_awaiting_debaters(): polls for replies, handles timeout with partial evidence,
posts judge comment, phase→awaiting_judge
- _advance_awaiting_judge(): polls for verdict; RACE FIX — update_issue_status() called
BEFORE queue.update_status("done") so poll_once can never double-enqueue
- Detection: primary=author_id match, fallback=[{name} response]: content marker (enables tests)
- Restart-safe: phase field persisted on every mutation; in-flight rounds resume correctly
Extended src/coordinator/queue.py:
- Round gains phase, phase_entered_at, coordinator_comment_id, judge_comment_id fields
- DebateQueue.update_phase() and running() added
- All new fields default-empty so existing queue.json files load cleanly
Extended src/coordinator/multica_client.py:
- update_issue_status() convenience wrapper
- create_issue() for integration / smoke tests
Updated src/coordinator/__main__.py:
- _orchestrate_pending stub replaced with real import from orchestrator
Tests:
- tests/test_orchestrator.py: 32 new unit tests covering phase transitions, timeouts,
race fix ordering, restart resume, full lifecycle
- tests/test_integration.py: @pytest.mark.integration test against real API
- smoke_test.py: standalone end-to-end script; ran against real API, verdict OK
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
R1: Add unit tests (22 total, all passing)
- tests/test_queue.py: enqueue/persist, is_issue_pending_or_running for all
statuses, update_status mutate+persist, save/load roundtrip, corrupt file
handling, pending() filtering
- tests/test_watcher.py: FakeMulticaClient drives poll_once across first
observation, same updated_at dedupe, updated_at change while pending/running
dedupe, re-enqueue after done, multi-issue, mix of new and seen
Refactor: extract poll_once(client, state, queue, logger) from watch_loop so
tests can call it directly without mocking time.sleep.
R2: Document known race near is_issue_pending_or_running — comment added
noting that orchestrator marking round done before updating issue status can
fire a second round; WYL-45 must resolve atomically.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Minimum viable structure for the Tool-MAD coordinator:
- coordinator.config: env-loaded Config dataclass, writes state to ~/.coordinator/
- coordinator.multica_client: thin requests wrapper for issues/comments/agents
- coordinator.state: flat-json SeenState tracking issue_id -> last_seen_updated_at
- coordinator.__main__: watch_loop() that polls in_review and logs candidates
- README.md: why this exists + how to run
v0 only detects in_review transitions; convening debate rounds is WYL-45.
Dependencies: stdlib + requests (nothing else until a working v1 ships).