coordinator

Author	SHA1	Message	Date
m-platform-admin	00ff80fbbb	Port CEK's debate-round instructions verbatim; prevent sycophantic convergence Live round on WYL-72 exposed that my debate-round comment produced sycophancy, not debate. Judge-Gemini moved from a 4.0 ACCEPT stance to "I agree my initial score was too lenient... I have adjusted my scores to 3s and 2s" after reading Judge-GPT's 2.92 report — without defending its original scores or challenging GPT's evidence. Classic social-pressure convergence. Root cause: my debate comment said "You may hold your position if you have new evidence; you may move if you find the other reasoning more grounded. Do not split the difference to compromise." That phrasing is both weaker than CEK's intent AND it dropped every structural anti-sycophancy instruction CEK spelled out in judge-with-debate/SKILL.md: Missing: "Identify disagreements (where your scores differ by >1 point)" Missing: "Defend your position with evidence from the specification" Missing: "Challenge the other judge's position with counter-evidence" Missing: "Only revise if you find their evidence compelling" Missing: "Defend your original scores if you still believe them" Also: I asked judges to post a REVISED report (implicitly retracting their prior position). CEK asks them to APPEND a debate round section to their prior report, keeping both visible so the revision is a change ON TOP OF the original rather than a replacement. Fixed by porting CEK's instruction block verbatim into _build_debate_round_comment. Added a regression test that fails if any future edit removes these exact clauses. Tests: 72 passed (+1 regression test).	2026-04-18 22:39:30 +02:00
m-platform-admin	1a77ddcb99	Repair \ `\u2192` in YAML; live test finding from WYL-72 round 1 Judge-Gemini (on github-copilot/gemini-3.1-pro-preview) emitted reports with \` (backslash-backtick) inside double-quoted YAML strings, imitating markdown escaping (e.g. evidence: "see \`foo.py\`"). \` is not a valid YAML escape sequence, so yaml.safe_load rejected the entire report. Judge-GPT did not make this mistake, so the consensus degenerated from 3-to-2 parseable reports to 1-to-2 (and therefore produced a spurious single-judge convergence). The fix is a targeted cleanup in _extract_yaml: replace \` with literal ` before parsing. No other interpretation of \` exists in YAML, so this does not mask semantics. Tests: 71 passed (added 2 for the new cleanup path). Separately surfaced (not yet addressed in this commit): - Judge-Claude on github-copilot/claude-sonnet-4.6 and github-copilot/claude-opus-4.5 returned model_not_supported. Per GitHub Copilot Student plan docs (2026-03-13 update), Claude Opus/Sonnet are no longer student-selectable. Tower-Copilot-Claude now runs claude-haiku-4.5 (which is student-accessible).	2026-04-18 22:26:50 +02:00
m-platform-admin	d1039d01de	Fix YAML parser: HTML-unescape content before parse Multica's comment REST API returns content with HTML entities escaped (`"` for `"`, `>` for `>`, etc.). Agent replies are plain UTF-8, so we unescape before extracting and parsing. Caught by the first live test against WYL-72: Meta-Judge produced a perfectly valid YAML evaluation specification with CK-001..CK-010 checklist items, but my parser reported it as malformed because `"` is not a valid YAML token. The model was fine; the plumbing was wrong. - src/coordinator/orchestrator.py: _extract_yaml now calls html.unescape first - tests/test_orchestrator.py: +2 tests covering entity decoding Tests: 69 passed.	2026-04-18 22:13:31 +02:00
m-platform-admin	f88255096e	Replace hand-written debater pipeline with CEK judge-with-debate The prior pipeline (4 hand-written debater prompts + 1 judge with my prompt template) kept missing scope drift because every prompt was mine and the reviewers were all on the same model tier with correlated priors. This commit replaces the whole review step with CEK's judge-with-debate pattern translated to multica-native execution: pending → awaiting_rubric (meta-judge writes YAML spec from issue alone) → awaiting_judges (3 judges on 3 copilot models score independently) → consensus check (overall within 0.5, criteria within 1.0) → accept or reject OR awaiting_debate rounds up to 3 → error on malformed YAML or cap hit Per higher-management direction, we do not deal with a model that cannot produce YAML: malformed rubric or all-unparseable judge reports fail the round immediately (no retries, no fallback to hand-written prompts). The anchor retrigger on REJECT (WYL-51 behaviour) is preserved verbatim. Agent prompts for meta-judge and the 3 judges come from the CEK agents themselves (Meta-Judge / Judge-GPT / Judge-Claude / Judge-Gemini) whose `instructions` field is the CEK meta-judge.md / judge.md files uploaded byte-for-byte. No prompts are authored in this coordinator's source. Adds pyyaml dependency. - src/coordinator/orchestrator.py: rewritten for the new phase machine - src/coordinator/queue.py: Round extended with rubric_yaml, judge_report_comment_ids, debate_round - tests/test_orchestrator.py: 40 tests for new pipeline (helpers, parsers, consensus math, phase handlers, race fix, retrigger) - tests/test_integration.py: removed (tested old debater pipeline) - pyproject.toml: adds pyyaml Tests: 67 passed in 0.20s (40 orchestrator + 15 queue + 7 watcher + 5 other).	2026-04-18 22:01:18 +02:00
m-senior-developer	840b3c388c	Improve logging across all coordinator modules - multica_client: add _req() helper logging HTTP method, URL, status code, and response time for every API call; warns with response body on 4xx/5xx - queue: log load/save errors with file paths; warn on JSON parse failures; log record count on load - state: same as queue — log load/save errors and entry count on load - orchestrator: add exc_info=True to all ERROR-level exception logs for full tracebacks; upgrade debater/judge near-timeout messages from DEBUG to WARNING at 80% of the timeout threshold - __main__: log file paths (queue, seen, log) at startup; add exc_info=True to poll error; emit periodic heartbeat with active round count every 100 cycles	2026-04-18 18:34:44 +00:00
m-senior-developer	5c57ec137b	Add scope-swap detection clause to all DEBATER_PROMPTS (WYL-60) Each of the 4 debater prompts now includes a second paragraph instructing the debater to independently judge whether the committed work addresses what was actually asked, or substituted an easier question. If a scope swap is detected, debaters are instructed to state "SCOPE SWAP DETECTED" and recommend REJECT on that basis alone. Adds _SCOPE_SWAP_CLAUSE constant and unit test asserting "SCOPE SWAP" appears in every DEBATER_PROMPTS entry. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 23:31:41 +00:00
m-senior-developer	d3db6cfcd7	Merge pull request 'WYL-52: suppress debater mention-chain cascade' (#1 ) from agent/senior-developer/wyl-52 into main	2026-04-16 00:55:23 +02:00
m-senior-developer	191cb2e01a	WYL-52: suppress debater mention-chain cascade Add _NO_MENTION_RULE to all DEBATER_PROMPTS and JUDGE_PROMPT_TEMPLATE explicitly instructing agents not to @-mention other agents in their replies. Mentions trigger Multica's mention-trigger mechanism, causing cascade tasks from debaters responding to each other's comments. Plain-name references (e.g. 'Code Reviewer' not '@Code Reviewer') are still allowed for cross-reference in evidence text.	2026-04-15 22:48:06 +00:00
m-senior-developer	11d1d675dd	WYL-51 round 2: requirement anchoring retrigger comment Replace the round-1 summariser approach with the full anchor pattern: A. _post_rejection_retrigger(round_, client, issue, verdict_comment_content, logger) - Renamed from _notify_assignee_on_reject - Non-agent/no-assignee path: post a non-mentioning coordinator note and return - Agent path: build full anchor comment via _build_retrigger_comment B. Anchor comment structure (verbatim, no summarising): 1. [@AssigneeName](mention://agent/<id>) 2. Verdict: REJECT (round <id>) 3. ## ANCHOR — Original requirements + full issue description in blockquote 4. ## Why this was rejected + full judge verdict in blockquote 5. ## Instructions for rework + REWORK_INSTRUCTIONS constant (verbatim) 6. Trailing audit line C. Round.retrigger_comment_id + DebateQueue.set_retrigger_comment_id D. 8 required tests (D1–D8): mention, verbatim description, no-drift constant, member skip, no-assignee skip, accept no-op, id persistence, race regression E. test_retrigger_on_reject_end_to_end integration test Removed: _extract_rejection_reasons, _build_rejection_followup (summarisers) Added: REWORK_INSTRUCTIONS constant, MulticaClient.get_agent_name Unit: 63 passed. Integration: 1 passed.	2026-04-15 22:24:39 +00:00
m-senior-developer	cb6156a4a1	WYL-51: re-trigger assignee on REJECT verdict After _advance_awaiting_judge kicks an issue back to in_progress on a REJECT, post a short followup comment that @-mentions the agent assignee with the verdict, top 2-3 failure reasons, and a retry prompt. Corner cases handled: - assignee_type != 'agent' (member or unset) → skip silently - ACCEPT branch → no notification - notification failure → logged, round still completes (non-blocking) New helpers: _extract_rejection_reasons, _build_rejection_followup, _notify_assignee_on_reject. +12 tests (5 for _extract_rejection_reasons, 7 for the notify path). Total: 66 passed.	2026-04-15 22:13:34 +00:00
m-senior-developer	0e44846032	Implement debate round orchestration (WYL-45) New module: src/coordinator/orchestrator.py - DEBATER_NAMES, JUDGE_NAME, DEBATER_PROMPTS, JUDGE_PROMPT_TEMPLATE hardcoded for v1 - Per-debater prompts tell each debater exactly which tool output to ground evidence in - orchestrate_pending() is the main entry point called from watch_loop - _start_round(): pending→running, posts debater mention comment, phase→awaiting_debaters - _advance_awaiting_debaters(): polls for replies, handles timeout with partial evidence, posts judge comment, phase→awaiting_judge - _advance_awaiting_judge(): polls for verdict; RACE FIX — update_issue_status() called BEFORE queue.update_status("done") so poll_once can never double-enqueue - Detection: primary=author_id match, fallback=[{name} response]: content marker (enables tests) - Restart-safe: phase field persisted on every mutation; in-flight rounds resume correctly Extended src/coordinator/queue.py: - Round gains phase, phase_entered_at, coordinator_comment_id, judge_comment_id fields - DebateQueue.update_phase() and running() added - All new fields default-empty so existing queue.json files load cleanly Extended src/coordinator/multica_client.py: - update_issue_status() convenience wrapper - create_issue() for integration / smoke tests Updated src/coordinator/__main__.py: - _orchestrate_pending stub replaced with real import from orchestrator Tests: - tests/test_orchestrator.py: 32 new unit tests covering phase transitions, timeouts, race fix ordering, restart resume, full lifecycle - tests/test_integration.py: @pytest.mark.integration test against real API - smoke_test.py: standalone end-to-end script; ran against real API, verdict OK Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 21:43:17 +00:00
m-senior-developer	8c9a174ddc	WYL-44 follow-up: add tests and extract poll_once for testability R1: Add unit tests (22 total, all passing) - tests/test_queue.py: enqueue/persist, is_issue_pending_or_running for all statuses, update_status mutate+persist, save/load roundtrip, corrupt file handling, pending() filtering - tests/test_watcher.py: FakeMulticaClient drives poll_once across first observation, same updated_at dedupe, updated_at change while pending/running dedupe, re-enqueue after done, multi-issue, mix of new and seen Refactor: extract poll_once(client, state, queue, logger) from watch_loop so tests can call it directly without mocking time.sleep. R2: Document known race near is_issue_pending_or_running — comment added noting that orchestrator marking round done before updating issue status can fire a second round; WYL-45 must resolve atomically. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 21:20:23 +00:00
m-senior-developer	6ebd36bccc	WYL-44: implement in_review watcher with persistent debate queue - Add queue.py: DebateQueue backed by ~/.coordinator/queue.json (Round dataclass, enqueue/update_status/pending/is_issue_pending_or_running) - Update watch_loop to call queue.enqueue() on in_review transitions instead of only logging; skip issues already pending/running to avoid double-queuing - Add queue_file path to Config - Add _orchestrate_pending stub (WYL-45 hook point) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 21:13:53 +00:00
m-platform-admin	6da434039c	WYL-42: Python skeleton + in_review watcher loop Minimum viable structure for the Tool-MAD coordinator: - coordinator.config: env-loaded Config dataclass, writes state to ~/.coordinator/ - coordinator.multica_client: thin requests wrapper for issues/comments/agents - coordinator.state: flat-json SeenState tracking issue_id -> last_seen_updated_at - coordinator.__main__: watch_loop() that polls in_review and logs candidates - README.md: why this exists + how to run v0 only detects in_review transitions; convening debate rounds is WYL-45. Dependencies: stdlib + requests (nothing else until a working v1 ships).	2026-04-15 23:04:06 +02:00
m-platform-admin	3935102d96	Initial commit	2026-04-15 23:00:55 +02:00

15 Commits