Repair \ \u2192 in YAML; live test finding from WYL-72 round 1
Judge-Gemini (on github-copilot/gemini-3.1-pro-preview) emitted reports with \` (backslash-backtick) inside double-quoted YAML strings, imitating markdown escaping (e.g. evidence: "see \`foo.py\`"). \` is not a valid YAML escape sequence, so yaml.safe_load rejected the entire report. Judge-GPT did not make this mistake, so the consensus degenerated from 3-to-2 parseable reports to 1-to-2 (and therefore produced a spurious single-judge convergence). The fix is a targeted cleanup in _extract_yaml: replace \` with literal ` before parsing. No other interpretation of \` exists in YAML, so this does not mask semantics. Tests: 71 passed (added 2 for the new cleanup path). Separately surfaced (not yet addressed in this commit): - Judge-Claude on github-copilot/claude-sonnet-4.6 and github-copilot/claude-opus-4.5 returned model_not_supported. Per GitHub Copilot Student plan docs (2026-03-13 update), Claude Opus/Sonnet are no longer student-selectable. Tower-Copilot-Claude now runs claude-haiku-4.5 (which is student-accessible).
This commit is contained in:
@@ -182,15 +182,21 @@ def _extract_yaml(content: str) -> str:
|
||||
(``"`` for ``"``, ``>`` for ``>``, etc.). Agent replies are
|
||||
plain UTF-8 to begin with, so we unescape first, then extract.
|
||||
|
||||
Also repairs ``\\``` (backslash-backtick) sequences to literal backticks.
|
||||
Some models emit evidence strings like ``evidence: "see \\`foo.py\\`"`` that
|
||||
imitate markdown escaping, but ``\\``` is not a valid YAML escape — fixing
|
||||
it here is a cleanup of an objective mistake, not toleration of malformed
|
||||
semantics.
|
||||
|
||||
Returns the YAML text (without fences), or the original content if no fence
|
||||
is found. The caller is responsible for deciding whether the raw content
|
||||
is parseable.
|
||||
"""
|
||||
unescaped = html.unescape(content)
|
||||
m = _YAML_FENCE_RE.search(unescaped)
|
||||
if m:
|
||||
return m.group(1).strip()
|
||||
return unescaped.strip()
|
||||
text = m.group(1).strip() if m else unescaped.strip()
|
||||
# Repair \` → ` (invalid YAML escape; models mean a literal backtick)
|
||||
return text.replace("\\`", "`")
|
||||
|
||||
|
||||
def _parse_rubric(content: str) -> dict[str, Any] | None:
|
||||
|
||||
@@ -215,6 +215,33 @@ def test_parse_rubric_accepts_html_encoded_input():
|
||||
assert "checklist" in spec
|
||||
|
||||
|
||||
def test_extract_yaml_repairs_backslash_backtick():
|
||||
# Gemini (and similar) emit \` inside double-quoted YAML strings, imitating
|
||||
# markdown escaping. \` is not a valid YAML escape, so we repair it.
|
||||
content = "evaluation_report:\n rubric_scores:\n - name: X\n score: 4\n evidence: \"see \\`foo.py\\` and \\`bar.py\\`\"\n"
|
||||
y = _extract_yaml(content)
|
||||
assert "\\`" not in y
|
||||
assert "`foo.py`" in y
|
||||
|
||||
|
||||
def test_parse_judge_report_tolerates_backslash_backtick():
|
||||
content = (
|
||||
"```yaml\n"
|
||||
"evaluation_report:\n"
|
||||
" score_calculation:\n"
|
||||
" final_score: 4.0\n"
|
||||
" rubric_scores:\n"
|
||||
" - name: Correctness\n"
|
||||
" score: 4\n"
|
||||
" weight: 1.0\n"
|
||||
" evidence: \"see \\`foo.py\\`\"\n"
|
||||
"```"
|
||||
)
|
||||
r = _parse_judge_report(content)
|
||||
assert r is not None
|
||||
assert r["score_calculation"]["final_score"] == 4.0
|
||||
|
||||
|
||||
def test_parse_rubric_valid_flat():
|
||||
spec = _parse_rubric(f"```yaml\n{_rubric_yaml_sample()}\n```")
|
||||
assert spec is not None
|
||||
|
||||
Reference in New Issue
Block a user