Repair \ \u2192 in YAML; live test finding from WYL-72 round 1

Judge-Gemini (on github-copilot/gemini-3.1-pro-preview) emitted reports with \` (backslash-backtick) inside double-quoted YAML strings, imitating markdown escaping (e.g. evidence: "see \`foo.py\`"). \` is not a valid YAML escape sequence, so yaml.safe_load rejected the entire report. Judge-GPT did not make this mistake, so the consensus degenerated from 3-to-2 parseable reports to 1-to-2 (and therefore produced a spurious single-judge convergence). The fix is a targeted cleanup in _extract_yaml: replace \` with literal ` before parsing. No other interpretation of \` exists in YAML, so this does not mask semantics. Tests: 71 passed (added 2 for the new cleanup path). Separately surfaced (not yet addressed in this commit): - Judge-Claude on github-copilot/claude-sonnet-4.6 and github-copilot/claude-opus-4.5 returned model_not_supported. Per GitHub Copilot Student plan docs (2026-03-13 update), Claude Opus/Sonnet are no longer student-selectable. Tower-Copilot-Claude now runs claude-haiku-4.5 (which is student-accessible).
2026-04-18 22:26:50 +02:00
parent d1039d01de
commit 1a77ddcb99
2 changed files with 36 additions and 3 deletions
@@ -182,15 +182,21 @@ def _extract_yaml(content: str) -> str:
    (``&#34;`` for ``"``, ``&gt;`` for ``>``, etc.).  Agent replies are
    plain UTF-8 to begin with, so we unescape first, then extract.

+    Also repairs ``\\``` (backslash-backtick) sequences to literal backticks.
+    Some models emit evidence strings like ``evidence: "see \\`foo.py\\`"`` that
+    imitate markdown escaping, but ``\\``` is not a valid YAML escape — fixing
+    it here is a cleanup of an objective mistake, not toleration of malformed
+    semantics.
+
    Returns the YAML text (without fences), or the original content if no fence
    is found.  The caller is responsible for deciding whether the raw content
    is parseable.
    """
    unescaped = html.unescape(content)
    m = _YAML_FENCE_RE.search(unescaped)
-    if m:
-        return m.group(1).strip()
-    return unescaped.strip()
+    text = m.group(1).strip() if m else unescaped.strip()
+    # Repair \` → ` (invalid YAML escape; models mean a literal backtick)
+    return text.replace("\\`", "`")


 def _parse_rubric(content: str) -> dict[str, Any] | None:
@@ -215,6 +215,33 @@ def test_parse_rubric_accepts_html_encoded_input():
    assert "checklist" in spec


+def test_extract_yaml_repairs_backslash_backtick():
+    # Gemini (and similar) emit \` inside double-quoted YAML strings, imitating
+    # markdown escaping.  \` is not a valid YAML escape, so we repair it.
+    content = "evaluation_report:\n  rubric_scores:\n    - name: X\n      score: 4\n      evidence: \"see \\`foo.py\\` and \\`bar.py\\`\"\n"
+    y = _extract_yaml(content)
+    assert "\\`" not in y
+    assert "`foo.py`" in y
+
+
+def test_parse_judge_report_tolerates_backslash_backtick():
+    content = (
+        "```yaml\n"
+        "evaluation_report:\n"
+        "  score_calculation:\n"
+        "    final_score: 4.0\n"
+        "  rubric_scores:\n"
+        "    - name: Correctness\n"
+        "      score: 4\n"
+        "      weight: 1.0\n"
+        "      evidence: \"see \\`foo.py\\`\"\n"
+        "```"
+    )
+    r = _parse_judge_report(content)
+    assert r is not None
+    assert r["score_calculation"]["final_score"] == 4.0
+
+
 def test_parse_rubric_valid_flat():
    spec = _parse_rubric(f"```yaml\n{_rubric_yaml_sample()}\n```")
    assert spec is not None