T

m-platform-admin 00ff80fbbb Port CEK's debate-round instructions verbatim; prevent sycophantic convergence

Live round on WYL-72 exposed that my debate-round comment produced
sycophancy, not debate.  Judge-Gemini moved from a 4.0 ACCEPT stance to
"I agree my initial score was too lenient... I have adjusted my scores
to 3s and 2s" after reading Judge-GPT's 2.92 report — without defending
its original scores or challenging GPT's evidence.  Classic social-pressure
convergence.

Root cause: my debate comment said "You may hold your position if you
have new evidence; you may move if you find the other reasoning more
grounded.  Do not split the difference to compromise."  That phrasing
is both weaker than CEK's intent AND it dropped every structural
anti-sycophancy instruction CEK spelled out in judge-with-debate/SKILL.md:

  Missing: "Identify disagreements (where your scores differ by >1 point)"
  Missing: "Defend your position with evidence from the specification"
  Missing: "Challenge the other judge's position with counter-evidence"
  Missing: "Only revise if you find their evidence compelling"
  Missing: "Defend your original scores if you still believe them"

Also: I asked judges to post a REVISED report (implicitly retracting
their prior position).  CEK asks them to APPEND a debate round section
to their prior report, keeping both visible so the revision is a change
ON TOP OF the original rather than a replacement.

Fixed by porting CEK's instruction block verbatim into _build_debate_round_comment.
Added a regression test that fails if any future edit removes these exact
clauses.

Tests: 72 passed (+1 regression test).

2026-04-18 22:39:30 +02:00

src/coordinator

Port CEK's debate-round instructions verbatim; prevent sycophantic convergence

2026-04-18 22:39:30 +02:00

tests

Port CEK's debate-round instructions verbatim; prevent sycophantic convergence

2026-04-18 22:39:30 +02:00

.gitignore

WYL-42: Python skeleton + in_review watcher loop

2026-04-15 23:04:06 +02:00

pyproject.toml

Replace hand-written debater pipeline with CEK judge-with-debate

2026-04-18 22:01:18 +02:00

README.md

WYL-42: Python skeleton + in_review watcher loop

2026-04-15 23:04:06 +02:00

smoke_test.py

Implement debate round orchestration (WYL-45)

2026-04-15 21:43:17 +00:00

README.md

coordinator

Tool-MAD middle-management layer for multica.

Why this exists

Multica's agents produce work and self-set issue status to in_review. Nothing in multica ever looks at that state — it's a dead end. Reviewers, when mentioned manually, tend to catch surface issues (length, structure) and miss semantic ones (scope drift, fabricated experiments, answering the wrong question).

This daemon is the missing middle-management layer. When an issue transitions to in_review, it:

Convenes a debate round: posts one comment on the issue mentioning a fixed set of debaters, each given a role-specific evidence-gathering prompt (e.g. Senior Developer greps the committed code for LLM API calls; Code Reviewer diffs the described method against the source; Project Manager Senior checks scope satisfaction against the original description).
Waits for every debater to reply (up to a timeout).
Posts a second comment mentioning the judge (Reality Checker), with the assembled debater transcript + the original issue body, and a hardcoded decision rule.
Parses the judge's structured verdict (VERDICT: ACCEPT or VERDICT: REJECT\n- R<n>: <failure>) and acts:
- ACCEPT → PUT status done, post acceptance summary
- REJECT → PUT status in_progress, post rejection listing every failure, re-trigger the original assignee

The pattern is Tool-MAD (Multi-Agent Debate with heterogeneous tool augmentation). Reference: arxiv 2601.04742.

Why debaters catch what a single judge misses

Each debater runs on a different agent (different runtime, different tool access, different role prompt) and is forced to ground its argument in a specific tool's output — not in its own recollection of the text. Example: if the question is "does this paper describe real LLM agent experiments", Senior Developer's grep for anthropic|openai|ollama|model= in the committed code either returns hits or it doesn't — that's a hard fact, not an opinion. The judge reads 4 grounded arguments and applies the decision rule "any debater reporting evidence of scope drift ⇒ REJECT."

A naive LLM-as-a-judge reviewer reads the paper and scores it on surface dimensions. An agent-as-judge driven by debaters catches the underlying substitution. See WYL-41 for the live failure case that motivated this build.

Run

pip install -e .
cat > ~/.coordinator/env <<'EOF'
COORDINATOR_SERVER_URL=http://localhost:8089
COORDINATOR_WORKSPACE_ID=<wid>
COORDINATOR_TOKEN=<coordinator-member-pat>
EOF
chmod 600 ~/.coordinator/env
coordinator

Logs go to stderr and to ~/.coordinator/coordinator.log. State file is ~/.coordinator/seen.json.

Status

WYL-42 skeleton + watcher (this commit)
WYL-43 dedicated admin PAT
WYL-44 hook watcher to real round trigger
WYL-45 debate round orchestration
WYL-46 verdict parser + action executor
WYL-47 dry-run against WYL-41