Autoresearch Demo: The RL Algorithm Brain Scan
Paper: rl_algorithm_brain_scan.pdf
Skills Used: Autoresearch, ML Paper Writing, GRPO RL Training, TRL, SAELens, TransformerLens
What Happened
An AI agent autonomously investigated what RL alignment algorithms actually do to model internals — a question no prior work had systematically addressed. The agent:
- Surveyed literature on RLOO, GRPO, and DPO, identifying the gap: nobody had compared what these algorithms do at the weight and feature level on the same base model
- Ran inner loop experiments training GPT-2 Small with RLOO, GRPO, and DPO on sentiment and toxicity tasks, then analyzing weight deltas via SVD and feature changes via SAELens
- Discovered three key findings through outer loop synthesis:
- DPO is a rank-1 perturbation (top-1 SVD direction recovers 95.6% of behavioral effect)
- Online RL (RLOO/GRPO) produces distributed, structure-preserving modifications (effective rank 200 vs 119)
- DPO creates a "concentrated perturbation cascade" disrupting 2x more SAE features in later layers
- Validated causally with SVD ablation experiments — not just correlation but causal evidence
- Wrote the paper in ICML format using the ml-paper-writing skill
Key Findings
- DPO is rank-1 alignment: A single SVD direction per weight matrix recovers 95.6% of DPO's behavioral effect. GRPO needs 50+ directions for equivalent recovery.
- Online RL preserves structure: RLOO and GRPO maintain higher effective rank (200 vs 119) and better preserve the base model's SAE feature structure (Jaccard 0.83 vs 0.69)
- DPO's concentrated perturbation cascade: Despite lower-rank changes, DPO disrupts 2x more SAE features in later layers (1619 vs 527-870), amplifying perturbations through the network
- Results hold across sentiment and toxicity tasks with statistical significance (n=3 seeds, non-overlapping CIs)
Why This Demo Matters
This demonstrates autoresearch orchestrating multiple domain skills together:
- Post-training skills (TRL, GRPO) for training the RL models
- Interpretability skills (SAELens, TransformerLens) for analyzing what changed
- Paper writing skill for producing the ICML submission
- The two-loop architecture enabled the agent to both run experiments AND synthesize them into mechanistic understanding — "DPO is a rank-1 perturbation" is a conceptual insight, not just a metric