The prior pipeline (4 hand-written debater prompts + 1 judge with my prompt
template) kept missing scope drift because every prompt was mine and the
reviewers were all on the same model tier with correlated priors.
This commit replaces the whole review step with CEK's judge-with-debate
pattern translated to multica-native execution:
pending → awaiting_rubric (meta-judge writes YAML spec from issue alone)
→ awaiting_judges (3 judges on 3 copilot models score independently)
→ consensus check (overall within 0.5, criteria within 1.0)
→ accept or reject OR awaiting_debate rounds up to 3
→ error on malformed YAML or cap hit
Per higher-management direction, we do not deal with a model that cannot
produce YAML: malformed rubric or all-unparseable judge reports fail the
round immediately (no retries, no fallback to hand-written prompts).
The anchor retrigger on REJECT (WYL-51 behaviour) is preserved verbatim.
Agent prompts for meta-judge and the 3 judges come from the CEK agents
themselves (Meta-Judge / Judge-GPT / Judge-Claude / Judge-Gemini) whose
`instructions` field is the CEK meta-judge.md / judge.md files uploaded
byte-for-byte. No prompts are authored in this coordinator's source.
Adds pyyaml dependency.
- src/coordinator/orchestrator.py: rewritten for the new phase machine
- src/coordinator/queue.py: Round extended with rubric_yaml, judge_report_comment_ids, debate_round
- tests/test_orchestrator.py: 40 tests for new pipeline (helpers, parsers, consensus math, phase handlers, race fix, retrigger)
- tests/test_integration.py: removed (tested old debater pipeline)
- pyproject.toml: adds pyyaml
Tests: 67 passed in 0.20s (40 orchestrator + 15 queue + 7 watcher + 5 other).
New module: src/coordinator/orchestrator.py
- DEBATER_NAMES, JUDGE_NAME, DEBATER_PROMPTS, JUDGE_PROMPT_TEMPLATE hardcoded for v1
- Per-debater prompts tell each debater exactly which tool output to ground evidence in
- orchestrate_pending() is the main entry point called from watch_loop
- _start_round(): pending→running, posts debater mention comment, phase→awaiting_debaters
- _advance_awaiting_debaters(): polls for replies, handles timeout with partial evidence,
posts judge comment, phase→awaiting_judge
- _advance_awaiting_judge(): polls for verdict; RACE FIX — update_issue_status() called
BEFORE queue.update_status("done") so poll_once can never double-enqueue
- Detection: primary=author_id match, fallback=[{name} response]: content marker (enables tests)
- Restart-safe: phase field persisted on every mutation; in-flight rounds resume correctly
Extended src/coordinator/queue.py:
- Round gains phase, phase_entered_at, coordinator_comment_id, judge_comment_id fields
- DebateQueue.update_phase() and running() added
- All new fields default-empty so existing queue.json files load cleanly
Extended src/coordinator/multica_client.py:
- update_issue_status() convenience wrapper
- create_issue() for integration / smoke tests
Updated src/coordinator/__main__.py:
- _orchestrate_pending stub replaced with real import from orchestrator
Tests:
- tests/test_orchestrator.py: 32 new unit tests covering phase transitions, timeouts,
race fix ordering, restart resume, full lifecycle
- tests/test_integration.py: @pytest.mark.integration test against real API
- smoke_test.py: standalone end-to-end script; ran against real API, verdict OK
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Minimum viable structure for the Tool-MAD coordinator:
- coordinator.config: env-loaded Config dataclass, writes state to ~/.coordinator/
- coordinator.multica_client: thin requests wrapper for issues/comments/agents
- coordinator.state: flat-json SeenState tracking issue_id -> last_seen_updated_at
- coordinator.__main__: watch_loop() that polls in_review and logs candidates
- README.md: why this exists + how to run
v0 only detects in_review transitions; convening debate rounds is WYL-45.
Dependencies: stdlib + requests (nothing else until a working v1 ships).