15 Commits

Author SHA1 Message Date
m-platform-admin 00ff80fbbb Port CEK's debate-round instructions verbatim; prevent sycophantic convergence
Live round on WYL-72 exposed that my debate-round comment produced
sycophancy, not debate.  Judge-Gemini moved from a 4.0 ACCEPT stance to
"I agree my initial score was too lenient... I have adjusted my scores
to 3s and 2s" after reading Judge-GPT's 2.92 report — without defending
its original scores or challenging GPT's evidence.  Classic social-pressure
convergence.

Root cause: my debate comment said "You may hold your position if you
have new evidence; you may move if you find the other reasoning more
grounded.  Do not split the difference to compromise."  That phrasing
is both weaker than CEK's intent AND it dropped every structural
anti-sycophancy instruction CEK spelled out in judge-with-debate/SKILL.md:

  Missing: "Identify disagreements (where your scores differ by >1 point)"
  Missing: "Defend your position with evidence from the specification"
  Missing: "Challenge the other judge's position with counter-evidence"
  Missing: "Only revise if you find their evidence compelling"
  Missing: "Defend your original scores if you still believe them"

Also: I asked judges to post a REVISED report (implicitly retracting
their prior position).  CEK asks them to APPEND a debate round section
to their prior report, keeping both visible so the revision is a change
ON TOP OF the original rather than a replacement.

Fixed by porting CEK's instruction block verbatim into _build_debate_round_comment.
Added a regression test that fails if any future edit removes these exact
clauses.

Tests: 72 passed (+1 regression test).
2026-04-18 22:39:30 +02:00
m-platform-admin 1a77ddcb99 Repair \ \u2192 in YAML; live test finding from WYL-72 round 1
Judge-Gemini (on github-copilot/gemini-3.1-pro-preview) emitted reports
with \` (backslash-backtick) inside double-quoted YAML strings,
imitating markdown escaping (e.g. evidence: "see \`foo.py\`").  \`
is not a valid YAML escape sequence, so yaml.safe_load rejected the entire
report.  Judge-GPT did not make this mistake, so the consensus degenerated
from 3-to-2 parseable reports to 1-to-2 (and therefore produced a spurious
single-judge convergence).

The fix is a targeted cleanup in _extract_yaml: replace \` with literal
` before parsing.  No other interpretation of \` exists in YAML, so
this does not mask semantics.

Tests: 71 passed (added 2 for the new cleanup path).

Separately surfaced (not yet addressed in this commit):
- Judge-Claude on github-copilot/claude-sonnet-4.6 and
  github-copilot/claude-opus-4.5 returned model_not_supported.  Per GitHub
  Copilot Student plan docs (2026-03-13 update), Claude Opus/Sonnet are no
  longer student-selectable.  Tower-Copilot-Claude now runs
  claude-haiku-4.5 (which is student-accessible).
2026-04-18 22:26:50 +02:00
m-platform-admin d1039d01de Fix YAML parser: HTML-unescape content before parse
Multica's comment REST API returns content with HTML entities escaped
(`"` for `"`, `>` for `>`, etc.). Agent replies are plain UTF-8,
so we unescape before extracting and parsing.

Caught by the first live test against WYL-72: Meta-Judge produced a
perfectly valid YAML evaluation specification with CK-001..CK-010 checklist
items, but my parser reported it as malformed because `"` is not a
valid YAML token. The model was fine; the plumbing was wrong.

- src/coordinator/orchestrator.py: _extract_yaml now calls html.unescape first
- tests/test_orchestrator.py: +2 tests covering entity decoding

Tests: 69 passed.
2026-04-18 22:13:31 +02:00
m-platform-admin f88255096e Replace hand-written debater pipeline with CEK judge-with-debate
The prior pipeline (4 hand-written debater prompts + 1 judge with my prompt
template) kept missing scope drift because every prompt was mine and the
reviewers were all on the same model tier with correlated priors.

This commit replaces the whole review step with CEK's judge-with-debate
pattern translated to multica-native execution:

  pending → awaiting_rubric (meta-judge writes YAML spec from issue alone)
          → awaiting_judges (3 judges on 3 copilot models score independently)
          → consensus check (overall within 0.5, criteria within 1.0)
          → accept or reject OR awaiting_debate rounds up to 3
          → error on malformed YAML or cap hit

Per higher-management direction, we do not deal with a model that cannot
produce YAML: malformed rubric or all-unparseable judge reports fail the
round immediately (no retries, no fallback to hand-written prompts).

The anchor retrigger on REJECT (WYL-51 behaviour) is preserved verbatim.

Agent prompts for meta-judge and the 3 judges come from the CEK agents
themselves (Meta-Judge / Judge-GPT / Judge-Claude / Judge-Gemini) whose
`instructions` field is the CEK meta-judge.md / judge.md files uploaded
byte-for-byte. No prompts are authored in this coordinator's source.

Adds pyyaml dependency.

- src/coordinator/orchestrator.py: rewritten for the new phase machine
- src/coordinator/queue.py: Round extended with rubric_yaml, judge_report_comment_ids, debate_round
- tests/test_orchestrator.py: 40 tests for new pipeline (helpers, parsers, consensus math, phase handlers, race fix, retrigger)
- tests/test_integration.py: removed (tested old debater pipeline)
- pyproject.toml: adds pyyaml

Tests: 67 passed in 0.20s (40 orchestrator + 15 queue + 7 watcher + 5 other).
2026-04-18 22:01:18 +02:00
m-senior-developer 840b3c388c Improve logging across all coordinator modules
- multica_client: add _req() helper logging HTTP method, URL, status code,
  and response time for every API call; warns with response body on 4xx/5xx
- queue: log load/save errors with file paths; warn on JSON parse failures;
  log record count on load
- state: same as queue — log load/save errors and entry count on load
- orchestrator: add exc_info=True to all ERROR-level exception logs for full
  tracebacks; upgrade debater/judge near-timeout messages from DEBUG to
  WARNING at 80% of the timeout threshold
- __main__: log file paths (queue, seen, log) at startup; add exc_info=True
  to poll error; emit periodic heartbeat with active round count every 100 cycles
2026-04-18 18:34:44 +00:00
m-senior-developer 5c57ec137b Add scope-swap detection clause to all DEBATER_PROMPTS (WYL-60)
Each of the 4 debater prompts now includes a second paragraph instructing
the debater to independently judge whether the committed work addresses what
was actually asked, or substituted an easier question. If a scope swap is
detected, debaters are instructed to state "SCOPE SWAP DETECTED" and
recommend REJECT on that basis alone.

Adds _SCOPE_SWAP_CLAUSE constant and unit test asserting "SCOPE SWAP"
appears in every DEBATER_PROMPTS entry.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 23:31:41 +00:00
m-senior-developer d3db6cfcd7 Merge pull request 'WYL-52: suppress debater mention-chain cascade' (#1) from agent/senior-developer/wyl-52 into main 2026-04-16 00:55:23 +02:00
m-senior-developer 191cb2e01a WYL-52: suppress debater mention-chain cascade
Add _NO_MENTION_RULE to all DEBATER_PROMPTS and JUDGE_PROMPT_TEMPLATE
explicitly instructing agents not to @-mention other agents in their
replies. Mentions trigger Multica's mention-trigger mechanism, causing
cascade tasks from debaters responding to each other's comments.

Plain-name references (e.g. 'Code Reviewer' not '@Code Reviewer') are
still allowed for cross-reference in evidence text.
2026-04-15 22:48:06 +00:00
m-senior-developer 11d1d675dd WYL-51 round 2: requirement anchoring retrigger comment
Replace the round-1 summariser approach with the full anchor pattern:

A. _post_rejection_retrigger(round_, client, issue, verdict_comment_content, logger)
   - Renamed from _notify_assignee_on_reject
   - Non-agent/no-assignee path: post a non-mentioning coordinator note and return
   - Agent path: build full anchor comment via _build_retrigger_comment

B. Anchor comment structure (verbatim, no summarising):
   1. [@AssigneeName](mention://agent/<id>)
   2. Verdict: REJECT (round <id>)
   3. ## ANCHOR — Original requirements + full issue description in blockquote
   4. ## Why this was rejected + full judge verdict in blockquote
   5. ## Instructions for rework + REWORK_INSTRUCTIONS constant (verbatim)
   6. Trailing audit line

C. Round.retrigger_comment_id + DebateQueue.set_retrigger_comment_id

D. 8 required tests (D1–D8): mention, verbatim description, no-drift constant,
   member skip, no-assignee skip, accept no-op, id persistence, race regression

E. test_retrigger_on_reject_end_to_end integration test

Removed: _extract_rejection_reasons, _build_rejection_followup (summarisers)
Added: REWORK_INSTRUCTIONS constant, MulticaClient.get_agent_name

Unit: 63 passed. Integration: 1 passed.
2026-04-15 22:24:39 +00:00
m-senior-developer cb6156a4a1 WYL-51: re-trigger assignee on REJECT verdict
After _advance_awaiting_judge kicks an issue back to in_progress on a
REJECT, post a short followup comment that @-mentions the agent assignee
with the verdict, top 2-3 failure reasons, and a retry prompt.

Corner cases handled:
- assignee_type != 'agent' (member or unset) → skip silently
- ACCEPT branch → no notification
- notification failure → logged, round still completes (non-blocking)

New helpers: _extract_rejection_reasons, _build_rejection_followup,
_notify_assignee_on_reject.

+12 tests (5 for _extract_rejection_reasons, 7 for the notify path).
Total: 66 passed.
2026-04-15 22:13:34 +00:00
m-senior-developer 0e44846032 Implement debate round orchestration (WYL-45)
New module: src/coordinator/orchestrator.py
- DEBATER_NAMES, JUDGE_NAME, DEBATER_PROMPTS, JUDGE_PROMPT_TEMPLATE hardcoded for v1
- Per-debater prompts tell each debater exactly which tool output to ground evidence in
- orchestrate_pending() is the main entry point called from watch_loop
- _start_round(): pending→running, posts debater mention comment, phase→awaiting_debaters
- _advance_awaiting_debaters(): polls for replies, handles timeout with partial evidence,
  posts judge comment, phase→awaiting_judge
- _advance_awaiting_judge(): polls for verdict; RACE FIX — update_issue_status() called
  BEFORE queue.update_status("done") so poll_once can never double-enqueue
- Detection: primary=author_id match, fallback=[{name} response]: content marker (enables tests)
- Restart-safe: phase field persisted on every mutation; in-flight rounds resume correctly

Extended src/coordinator/queue.py:
- Round gains phase, phase_entered_at, coordinator_comment_id, judge_comment_id fields
- DebateQueue.update_phase() and running() added
- All new fields default-empty so existing queue.json files load cleanly

Extended src/coordinator/multica_client.py:
- update_issue_status() convenience wrapper
- create_issue() for integration / smoke tests

Updated src/coordinator/__main__.py:
- _orchestrate_pending stub replaced with real import from orchestrator

Tests:
- tests/test_orchestrator.py: 32 new unit tests covering phase transitions, timeouts,
  race fix ordering, restart resume, full lifecycle
- tests/test_integration.py: @pytest.mark.integration test against real API
- smoke_test.py: standalone end-to-end script; ran against real API, verdict OK

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:43:17 +00:00
m-senior-developer 8c9a174ddc WYL-44 follow-up: add tests and extract poll_once for testability
R1: Add unit tests (22 total, all passing)
- tests/test_queue.py: enqueue/persist, is_issue_pending_or_running for all
  statuses, update_status mutate+persist, save/load roundtrip, corrupt file
  handling, pending() filtering
- tests/test_watcher.py: FakeMulticaClient drives poll_once across first
  observation, same updated_at dedupe, updated_at change while pending/running
  dedupe, re-enqueue after done, multi-issue, mix of new and seen

Refactor: extract poll_once(client, state, queue, logger) from watch_loop so
tests can call it directly without mocking time.sleep.

R2: Document known race near is_issue_pending_or_running — comment added
noting that orchestrator marking round done before updating issue status can
fire a second round; WYL-45 must resolve atomically.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:20:23 +00:00
m-senior-developer 6ebd36bccc WYL-44: implement in_review watcher with persistent debate queue
- Add queue.py: DebateQueue backed by ~/.coordinator/queue.json
  (Round dataclass, enqueue/update_status/pending/is_issue_pending_or_running)
- Update watch_loop to call queue.enqueue() on in_review transitions instead
  of only logging; skip issues already pending/running to avoid double-queuing
- Add queue_file path to Config
- Add _orchestrate_pending stub (WYL-45 hook point)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:13:53 +00:00
m-platform-admin 6da434039c WYL-42: Python skeleton + in_review watcher loop
Minimum viable structure for the Tool-MAD coordinator:
- coordinator.config: env-loaded Config dataclass, writes state to ~/.coordinator/
- coordinator.multica_client: thin requests wrapper for issues/comments/agents
- coordinator.state: flat-json SeenState tracking issue_id -> last_seen_updated_at
- coordinator.__main__: watch_loop() that polls in_review and logs candidates
- README.md: why this exists + how to run

v0 only detects in_review transitions; convening debate rounds is WYL-45.
Dependencies: stdlib + requests (nothing else until a working v1 ships).
2026-04-15 23:04:06 +02:00
m-platform-admin 3935102d96 Initial commit 2026-04-15 23:00:55 +02:00