Port CEK's debate-round instructions verbatim; prevent sycophantic convergence

Live round on WYL-72 exposed that my debate-round comment produced
sycophancy, not debate.  Judge-Gemini moved from a 4.0 ACCEPT stance to
"I agree my initial score was too lenient... I have adjusted my scores
to 3s and 2s" after reading Judge-GPT's 2.92 report — without defending
its original scores or challenging GPT's evidence.  Classic social-pressure
convergence.

Root cause: my debate comment said "You may hold your position if you
have new evidence; you may move if you find the other reasoning more
grounded.  Do not split the difference to compromise."  That phrasing
is both weaker than CEK's intent AND it dropped every structural
anti-sycophancy instruction CEK spelled out in judge-with-debate/SKILL.md:

  Missing: "Identify disagreements (where your scores differ by >1 point)"
  Missing: "Defend your position with evidence from the specification"
  Missing: "Challenge the other judge's position with counter-evidence"
  Missing: "Only revise if you find their evidence compelling"
  Missing: "Defend your original scores if you still believe them"

Also: I asked judges to post a REVISED report (implicitly retracting
their prior position).  CEK asks them to APPEND a debate round section
to their prior report, keeping both visible so the revision is a change
ON TOP OF the original rather than a replacement.

Fixed by porting CEK's instruction block verbatim into _build_debate_round_comment.
Added a regression test that fails if any future edit removes these exact
clauses.

Tests: 72 passed (+1 regression test).
This commit is contained in:
2026-04-18 22:39:30 +02:00
parent 1a77ddcb99
commit 00ff80fbbb
2 changed files with 49 additions and 3 deletions
+27 -3
View File
@@ -419,7 +419,14 @@ def _build_debate_round_comment(
round_num: int,
prior_reports: dict[str, str],
) -> str:
"""Post a debate-round mention comment showing all 3 judges each others' prior reports."""
"""Post a debate-round mention comment showing all 3 judges each others' prior reports.
Instruction text is taken directly from CEK's judge-with-debate skill
(plugins/sadd/skills/judge-with-debate/SKILL.md, "Additional debate
instructions" and "CRITICAL" blocks). Rewriting this into softer
phrasing produced sycophantic convergence in the first live run; we
stick to CEK's wording.
"""
mentions = " ".join(
f"[@{name}](mention://agent/{judge_agent_ids[name]})"
if judge_agent_ids.get(name)
@@ -429,13 +436,30 @@ def _build_debate_round_comment(
parts = [
mentions,
f"**Debate round {round_num}** — your prior reports disagree beyond the consensus threshold.",
"Each of you: read the others' reports below, re-examine the commit, then post a REVISED `evaluation_report` YAML. You may hold your position if you have new evidence; you may move if you find the other reasoning more grounded. Do not split the difference to compromise.",
"Follow your full judge process as defined in your agent instructions.",
"",
"**Additional debate instructions:**",
"1. Read your previous assessment (your own prior report below).",
"2. Read all other judges' reports (also below).",
"3. Identify disagreements (where your scores differ by >1 point).",
"4. For each major disagreement:",
" - State the disagreement clearly",
" - Defend your position with evidence from the solution and evaluation specification",
" - Challenge the other judge's position with counter-evidence",
" - Consider whether their evidence changes your view",
"5. Post a new reply containing a `## Debate Round {R}` section that APPENDS to your prior report (include your prior report verbatim first, then the new section).",
"6. At the end of your reply, state whether you reached agreement and with which judge. Include your (possibly revised) overall score and per-criterion scores in the same YAML format.",
"",
"**CRITICAL:**",
"- Ground your arguments in the evaluation specification criteria.",
"- **Only revise if you find their evidence compelling.**",
"- **Defend your original scores if you still believe them.**",
"- Quote specific evidence from the solution.",
]
for name in JUDGE_NAMES:
content = prior_reports.get(name, "*(no prior report)*")
parts.append(f"---\n\n### {name} — prior report\n\n{content}")
parts.append("---")
parts.append("Post your revised report as a NEW reply. Do not edit prior reports.")
return "\n\n".join(parts)
+22
View File
@@ -396,6 +396,28 @@ def test_debate_round_comment_quotes_all_prior_reports():
assert f"REPORT FROM {n}" in body
def test_debate_round_comment_contains_cek_anti_sycophancy_language():
"""Regression: do not soften CEK's critical debate instructions.
The first live run produced sycophantic convergence because the earlier,
softer phrasing dropped CEK's explicit 'only revise if compelling / defend
original if you still believe them' instructions. Any future edit that
removes these exact clauses should be caught here.
"""
judge_ids = {n: f"id-{n}" for n in JUDGE_NAMES}
body = _build_debate_round_comment(judge_ids, 1, {n: "x" for n in JUDGE_NAMES})
# CEK's structural instructions
assert "Identify disagreements (where your scores differ by >1 point)" in body
assert "Defend your position with evidence" in body
assert "Challenge the other judge's position with counter-evidence" in body
# CEK's CRITICAL anti-sycophancy list
assert "Only revise if you find their evidence compelling." in body
assert "Defend your original scores if you still believe them." in body
# APPEND not REVISE
assert "APPENDS to your prior report" in body
assert "REVISED" not in body # the old softer phrasing is gone
def test_retrigger_comment_has_anchor_and_no_drift_instructions():
body = _build_retrigger_comment("Worker", "agent-worker", "Original desc line.", "VERDICT: REJECT", "r1")
assert "mention://agent/agent-worker" in body