Files
ara/orchestra-skills/04-mechanistic-interpretability/pyvene/references/README.md
T

2.1 KiB

pyvene Reference Documentation

This directory contains comprehensive reference materials for pyvene.

Contents

  • api.md - Complete API reference for IntervenableModel, intervention types, and configurations
  • tutorials.md - Step-by-step tutorials for causal tracing, activation patching, and trainable interventions

Installation

pip install pyvene

Basic Usage

import pyvene as pv
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Define intervention
config = pv.IntervenableConfig(
    representations=[
        pv.RepresentationConfig(
            layer=5,
            component="block_output",
            intervention_type=pv.VanillaIntervention,
        )
    ]
)

# Create intervenable model
intervenable = pv.IntervenableModel(config, model)

# Run intervention (swap activations from source to base)
base_inputs = tokenizer("The cat sat on the", return_tensors="pt")
source_inputs = tokenizer("The dog ran through the", return_tensors="pt")

_, outputs = intervenable(
    base=base_inputs,
    sources=[source_inputs],
)

Key Concepts

Intervention Types

  • VanillaIntervention: Swap activations between runs
  • AdditionIntervention: Add source to base activations
  • ZeroIntervention: Zero out activations (ablation)
  • CollectIntervention: Collect activations without modifying
  • RotatedSpaceIntervention: Trainable intervention for causal discovery

Components

Target specific parts of the model:

  • block_input, block_output
  • mlp_input, mlp_output, mlp_activation
  • attention_input, attention_output
  • query_output, key_output, value_output

HuggingFace Integration

Save and load interventions via HuggingFace Hub for reproducibility.