feat: add Orchestra-Research reference ARAs and skills library

2026-05-05 23:28:05 +02:00
parent 846fe8b90e
commit 964c1dacc9
36 changed files with 1750 additions and 0 deletions
@@ -0,0 +1,33 @@
+---
+title: "Example Paper: Attention Is All You Need"
+authors: [Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin]
+year: 2017
+venue: "NeurIPS"
+doi: "arXiv:1706.03762"
+ara_version: "1.0"
+domain: "Natural Language Processing"
+keywords: [transformer, attention, sequence-to-sequence, machine translation]
+claims_summary:
+  - "Self-attention alone achieves SOTA on machine translation"
+  - "Transformers train faster than recurrent/convolutional alternatives"
+abstract: "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks..."
+---
+
+# Attention Is All You Need
+
+## Overview
+
+This is a minimal example ARA artifact demonstrating the format. A real artifact would contain complete content in every file.
+
+## Layer Index
+
+### Cognitive Layer (`/logic`)
+| File | Description |
+|------|-------------|
+| [problem.md](logic/problem.md) | Observations about RNN limitations -> transformer insight |
+| [claims.md](logic/claims.md) | 2 falsifiable claims (C01-C02) |
+
+### Exploration Graph (`/trace`)
+| File | Description |
+|------|-------------|
+| [exploration_tree.yaml](trace/exploration_tree.yaml) | Minimal 3-node research DAG |
@@ -0,0 +1,17 @@
+# Claims
+
+## C01: Attention-only architecture achieves SOTA
+- **Statement**: A model based entirely on self-attention, without recurrence or convolution, achieves state-of-the-art BLEU on WMT 2014 English-to-German translation.
+- **Status**: supported
+- **Falsification criteria**: A recurrent or convolutional model trained under identical conditions achieves higher BLEU.
+- **Proof**: [E01]
+- **Dependencies**: []
+- **Tags**: architecture, translation, BLEU
+
+## C02: Transformers train faster
+- **Statement**: The Transformer requires significantly less training time than architectures based on recurrent or convolutional layers.
+- **Status**: supported
+- **Falsification criteria**: An RNN-based model achieves comparable quality with equal or less compute.
+- **Proof**: [E02]
+- **Dependencies**: [C01]
+- **Tags**: efficiency, training-time
@@ -0,0 +1,30 @@
+# Problem Specification
+
+## Observations
+
+### O1: Sequential computation bottleneck
+- **Statement**: RNNs process tokens sequentially, preventing parallelization and limiting training speed on modern hardware.
+- **Evidence**: Section 1, established practice
+- **Implication**: Training time scales linearly with sequence length.
+
+### O2: Long-range dependency decay
+- **Statement**: Despite gating mechanisms (LSTM, GRU), learning dependencies between distant positions remains difficult.
+- **Evidence**: Section 1, prior work
+- **Implication**: Quality degrades for long sequences.
+
+## Gaps
+
+### G1: No fully parallel sequence model
+- **Statement**: No competitive sequence transduction model eliminates sequential computation entirely.
+- **Caused by**: O1
+- **Existing attempts**: Convolutional models (ByteNet, ConvS2S) reduce but do not eliminate the bottleneck.
+- **Why they fail**: Still require O(log n) or O(n/k) operations for long-range dependencies.
+
+## Key Insight
+- **Insight**: Self-attention computes all pairwise token interactions in O(1) sequential operations, enabling full parallelization while maintaining direct access to any position.
+- **Derived from**: O1, O2
+- **Enables**: A fully attention-based architecture (Transformer) that is both faster to train and better at capturing long-range dependencies.
+
+## Assumptions
+- A1: Sufficient GPU memory to store the full attention matrix (O(n^2) space).
+- A2: Positional information can be injected via learned or sinusoidal encodings.
@@ -0,0 +1,28 @@
+# Exploration Tree — Attention Is All You Need (minimal example)
+# This is a simplified example. Real artifacts have 8+ nodes.
+
+tree:
+  - id: N01
+    type: question
+    title: "Can we build a competitive sequence model without recurrence?"
+    description: >
+      RNNs are sequential bottlenecks. Can self-attention alone handle
+      sequence transduction at SOTA quality?
+    children:
+
+      - id: N02
+        type: experiment
+        title: "Train Transformer on WMT 2014 EN-DE"
+        result: >
+          28.4 BLEU on EN-DE, surpassing all previous single models
+          including ensembles.
+        evidence: [C01, "Table 2"]
+
+      - id: N03
+        type: decision
+        title: "Use sinusoidal positional encodings"
+        choice: "Fixed sinusoidal encodings over learned embeddings"
+        alternatives:
+          - "Learned positional embeddings"
+          - "Relative position representations"
+        evidence: "Table 3 ablation shows nearly identical performance"
@@ -0,0 +1,58 @@
+---
+title: "Deep Residual Learning for Image Recognition"
+authors: ["Kaiming He", "Xiangyu Zhang", "Shaoqing Ren", "Jian Sun"]
+year: 2015
+venue: "arXiv (later CVPR 2016)"
+doi: "arXiv:1512.03385"
+ara_version: "1.0"
+domain: "computer vision / deep learning"
+keywords: ["residual learning", "deep networks", "image classification", "ImageNet", "CIFAR-10", "shortcut connections", "identity mapping", "bottleneck", "ILSVRC 2015", "object detection"]
+claims_summary:
+  - "Stacking more layers in plain CNNs causes a degradation in training accuracy that is not explained by overfitting or vanishing gradients."
+  - "Reformulating layers to fit a residual mapping F(x) = H(x) - x with identity shortcuts removes the degradation and allows accuracy to grow with depth up to 152 layers on ImageNet."
+  - "Identity shortcuts are sufficient (no extra parameters); deeper bottleneck blocks make 50/101/152-layer ResNets practical and achieve 3.57% top-5 ImageNet ensemble error, winning ILSVRC 2015."
+abstract: "Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers — 8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation."
+---
+
+# Deep Residual Learning for Image Recognition
+
+## Overview
+
+He et al. identify a *degradation* problem: as plain feed-forward CNNs grow deeper, both training and test error get worse — even though the deeper network's solution space contains the shallower one. They propose **residual learning**: each block fits F(x) := H(x) − x, with the original mapping recovered by adding back x via a parameter-free identity *shortcut connection*. The reformulation makes very deep networks (up to 152 layers, and even 1202 on CIFAR-10) trainable end-to-end with SGD, with accuracy that grows with depth. An ensemble of six ResNets achieved **3.57% top-5 error** on ImageNet test, winning ILSVRC 2015 classification, and ResNet-101 yielded a 28% relative mAP improvement on COCO detection over a VGG-16 baseline.
+
+This artifact captures the core ResNet slice that motivates the ARA format particularly well: deeper plain networks degrade as depth increases, residual reformulation changes the optimization problem, and evidence ties the mechanism directly to empirical gains. It binds claims, experiments, evidence, code stubs, and the failed plain-depth branch into one traversable artifact. When the paper does not present an explicit research-session log, reconstructed trace decisions are marked as inferred rather than presented as direct historical facts.
+
+## Layer Index
+
+### Cognitive Layer (`/logic`)
+| File | Description |
+|------|-------------|
+| [problem.md](logic/problem.md) | Degradation observations on CIFAR-10 and ImageNet → optimization gap → residual reformulation insight |
+| [claims.md](logic/claims.md) | 8 falsifiable claims (C01–C08) on degradation, residual easing, depth gains, shortcut design, generalization |
+| [concepts.md](logic/concepts.md) | 8 formal concepts (residual mapping, identity / projection shortcut, bottleneck block, degradation problem, plain network, BN, 10-crop testing) |
+| [experiments.md](logic/experiments.md) | 6 declarative experiment plans (E01–E06) covering plain vs. residual, depth scan, shortcut options, bottleneck, CIFAR depth, COCO transfer |
+| [solution/architecture.md](logic/solution/architecture.md) | Component graph: stem, conv stages, residual / bottleneck blocks, shortcut variants, head |
+| [solution/algorithm.md](logic/solution/algorithm.md) | Math formulation of F(x)+x, pseudocode for forward pass, complexity analysis |
+| [solution/constraints.md](logic/solution/constraints.md) | Dimension-matching constraints, when option A vs B/C applies, regularization caveats |
+| [solution/heuristics.md](logic/solution/heuristics.md) | 6 heuristics (H01–H06) — identity over projection, BN placement, warmup for very deep nets, etc. |
+| [related_work.md](logic/related_work.md) | Typed dependency graph: residual representations, shortcut connections, highway networks, baselines |
+
+### Physical Layer (`/src`)
+| File | Description | Claims |
+|------|-------------|--------|
+| [configs/training.md](src/configs/training.md) | ImageNet/CIFAR SGD hyperparameters with rationale | C02, C03, C07 |
+| [configs/model.md](src/configs/model.md) | Layer counts, channels, FLOPs for ResNet-{18,34,50,101,152} and CIFAR variants | C03, C05 |
+| [configs/imagenet_resnet34.yaml](src/configs/imagenet_resnet34.yaml) | Concrete config for the 34-layer ImageNet ResNet | C02 |
+| [execution/residual_block.py](src/execution/residual_block.py) | Basic and bottleneck residual blocks with identity / projection shortcut options | C01, C02, C04, C05 |
+| [execution/training_recipe.py](src/execution/training_recipe.py) | SGD + step LR schedule + BN-after-conv recipe for ImageNet ResNets | C02, C03 |
+| [environment.md](src/environment.md) | Framework, hardware, augmentation, seed assumptions | C02, C07 |
+
+### Exploration Graph (`/trace`)
+| File | Description |
+|------|-------------|
+| [exploration_tree.yaml](trace/exploration_tree.yaml) | 14-node research DAG with explicit / inferred provenance, dead ends and decisions |
+
+### Evidence (`/evidence`)
+| File | Description |
+|------|-------------|
+| [README.md](evidence/README.md) | Index of 8 raw tables / 4 figures / 1 derived subset and the claims they support |
@@ -0,0 +1,24 @@
+# Evidence Index
+
+## Tables
+
+| File | Source | Claims | Description |
+|------|--------|--------|-------------|
+| [tables/table1_imagenet_architectures.md](tables/table1_imagenet_architectures.md) | Table 1, §3.3 | C03, C05 | Per-depth ImageNet ResNet architectures — block layouts and FLOPs for ResNet-{18, 34, 50, 101, 152}. |
+| [tables/table2_imagenet_plain_vs_residual.md](tables/table2_imagenet_plain_vs_residual.md) | Table 2, §4.1 | C01, C02 | Top-1 ImageNet validation error — plain-{18, 34} vs. ResNet-{18, 34} with 10-crop testing. |
+| [tables/table3_imagenet_validation_full.md](tables/table3_imagenet_validation_full.md) | Table 3, §4.1 | C01, C02, C03, C04, C05 | Full ImageNet validation error table (10-crop): VGG-16, GoogLeNet, PReLU-net, plain-34, ResNet-{34A, 34B, 34C, 50, 101, 152}. |
+| [tables/derived_from_table3_shortcut_options.md](tables/derived_from_table3_shortcut_options.md) | Derived from Table 3 | C04 | Subset of Table 3 isolating ResNet-34 shortcut options A / B / C plus plain-34 baseline. |
+| [tables/table4_imagenet_singlemodel.md](tables/table4_imagenet_singlemodel.md) | Table 4, §4.1 | C03 | ImageNet single-model validation error (multi-scale fully-convolutional testing). |
+| [tables/table5_imagenet_ensembles.md](tables/table5_imagenet_ensembles.md) | Table 5, §4.1 | C03 | ImageNet ensemble top-5 error on the test set — ResNet 6-model ensemble achieves 3.57%. |
+| [tables/table6_cifar10.md](tables/table6_cifar10.md) | Table 6, §4.2 | C06, C07 | CIFAR-10 test error vs. depth for ResNet-{20, 32, 44, 56, 110, 1202} with baselines. |
+| [tables/table7_pascal_voc_detection.md](tables/table7_pascal_voc_detection.md) | Table 7, §4.3 | C08 | PASCAL VOC 07/12 detection mAP with baseline Faster R-CNN — VGG-16 vs. ResNet-101. |
+| [tables/table8_coco_detection.md](tables/table8_coco_detection.md) | Table 8, §4.3 | C08 | COCO val detection mAP with baseline Faster R-CNN — VGG-16 vs. ResNet-101. |
+
+## Figures
+
+| File | Source | Claims | Description |
+|------|--------|--------|-------------|
+| [figures/figure1_cifar_plain_curves.md](figures/figure1_cifar_plain_curves.md) | Figure 1, §1 | C01 | CIFAR-10 training (left) and test (right) error curves for plain-{20, 56} — illustrates degradation. |
+| [figures/figure4_imagenet_curves.md](figures/figure4_imagenet_curves.md) | Figure 4, §4.1 | C01, C02 | ImageNet training/validation curves: plain-{18, 34} (left) vs. ResNet-{18, 34} (right). |
+| [figures/figure6_cifar_curves.md](figures/figure6_cifar_curves.md) | Figure 6, §4.2 | C06 | CIFAR-10 training/test curves for plain (left), ResNet-{20, 32, 44, 56, 110} (middle), and ResNet-{110, 1202} (right). |
+| [figures/figure7_layer_response_std.md](figures/figure7_layer_response_std.md) | Figure 7, §4.2 | C02 (Fig. 7 supports the "responses closer to zero" interpretation referenced under O5) | Layer-response std on CIFAR-10 for plain-{20, 56} and ResNet-{20, 56, 110}, in original layer order (top) and ranked by magnitude (bottom). |
@@ -0,0 +1,16 @@
+# Figure 1: CIFAR-10 training/test error for plain-{20, 56}
+
+**Source**: Figure 1, §1
+**Caption**: "Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer 'plain' networks. The deeper network has higher training error, and thus test error. Similar phenomena on ImageNet is presented in Fig. 4."
+**Axes**: X = iterations (×10⁴), Y = error (%)
+**Extraction type**: figure_summary
+
+The figure is a 2-panel plot (left = training error, right = test error). Exact per-iteration data points are not tabulated in the paper. Qualitative readings:
+
+| iter range | plain-20 train err. | plain-56 train err. | plain-20 test err. | plain-56 test err. |
+|------------|---------------------|---------------------|--------------------|--------------------|
+| early (≤1×10⁴) | ≈ high (>40%) | ≈ high (>40%) | ≈ high (>40%) | ≈ high (>40%) |
+| mid  (~3×10⁴, after first LR drop) | drops sharply to ≈ 20% | drops to ≈ 25–30% | similar drop to ≈ 20–25% | drops to ≈ 25–30% |
+| late (~6×10⁴) | ≈ 5% | ≈ 10% | ≈ 8–10% | ≈ 13–15% |
+
+**Key observation**: throughout training, the 56-layer plain net's curves lie *above* the 20-layer plain net's, demonstrating the degradation problem (referenced as O1 in `/logic/problem.md`).
@@ -0,0 +1,21 @@
+# Figure 4: ImageNet training and validation curves — plain vs. ResNet at 18/34 layers
+
+**Source**: Figure 4, §4.1
+**Caption**: "Training on ImageNet. Thin curves denote training error, and bold curves denote validation error of the center crops. Left: plain networks of 18 and 34 layers. Right: ResNets of 18 and 34 layers. In this plot, the residual networks have no extra parameter compared to their plain counterparts."
+**Axes**: X = iterations (×10⁴, axis runs 0–50), Y = error (%) (axis runs ~20–60)
+**Extraction type**: figure_summary
+
+Per-iteration values are not tabulated by the paper. Qualitative shape (read off from the figure):
+
+### Left panel — plain-18 (cyan) vs. plain-34 (red)
+- Both curves descend in three plateaus separated by LR drops at ≈ 20×10⁴ and ≈ 40×10⁴.
+- **plain-34 lies above plain-18 throughout training** (both training and validation curves), demonstrating degradation throughout, not only at convergence.
+- Final readings (≈ 50×10⁴): plain-18 validation ≈ 28%, plain-34 validation ≈ 28.5% (consistent with Table 2: 27.94 vs. 28.54).
+
+### Right panel — ResNet-18 (cyan) vs. ResNet-34 (red)
+- Both curves descend in similar three-plateau fashion.
+- **ResNet-34 lies below ResNet-18 throughout training** — the order is *reversed* from the plain case.
+- ResNet-18 also converges *faster* than plain-18 in the early phase (residual easing of optimization), reaching low error sooner even though final accuracy is comparable.
+- Final readings (≈ 50×10⁴): ResNet-18 validation ≈ 27.9%, ResNet-34 validation ≈ 25% (consistent with Table 2: 27.88 vs. 25.03).
+
+**Key observation**: The plain pair shows degradation; the residual pair removes it (cited under C01, C02).
@@ -0,0 +1,20 @@
+# Figure 6: CIFAR-10 training/test curves for plain and ResNet families
+
+**Source**: Figure 6, §4.2
+**Caption**: "Training on CIFAR-10. Dashed lines denote training error, and bold lines denote testing error. Left: plain networks. The error of plain-110 is higher than 60% and not displayed. Middle: ResNets. Right: ResNets with 110 and 1202 layers."
+**Axes**: X = iterations (×10⁴), Y = error (%)
+**Extraction type**: figure_summary
+
+### Left — plain-{20, 32, 44, 56, 110}
+- Error curves diverge with depth: deeper plain nets have *higher* training and test error in the late phase.
+- Plain-110 fails badly (>60% error throughout) and is not shown — strong evidence for degradation at very large plain-net depth.
+
+### Middle — ResNet-{20, 32, 44, 56, 110}
+- Curves stack monotonically: deeper ResNets achieve lower training and test error.
+- Final test error matches Table 6 (8.75 → 7.51 → 7.17 → 6.97 → 6.43 for depths 20 → 110).
+
+### Right — ResNet-{110, 1202}
+- ResNet-1202 trains successfully — final training error <0.1% (text §"Exploring Over 1000 layers").
+- ResNet-1202 test error is *higher* than ResNet-110 (7.93 vs. 6.43), consistent with overfitting on a 50k-image dataset given a 19.4M-parameter model.
+
+**Key observation**: ResNet families overcome the plain-net degradation, and even 1202 layers train cleanly — but extreme depth overfits without explicit regularization on this small dataset (cited under C06).
@@ -0,0 +1,16 @@
+# Figure 7: CIFAR-10 layer-response std (after BN, before nonlinearity)
+
+**Source**: Figure 7, §4.2
+**Caption**: "Standard deviations (std) of layer responses on CIFAR-10. The responses are the outputs of each 3×3 layer, after BN and before nonlinearity. Top: the layers are shown in their original order. Bottom: the responses are ranked in descending order."
+**Axes**: X (top) = layer index in original order; X (bottom) = layer index ranked by magnitude. Y = std of activations.
+**Extraction type**: figure_summary
+
+The figure overlays four series: plain-20, plain-56, ResNet-20, ResNet-56, ResNet-110.
+
+Qualitative observations from the plot (numerical curves are not tabulated):
+
+- **ResNet response stds are smaller than plain-net stds at corresponding layer indices**, supporting the paper's argument that residual functions are typically closer to zero than non-residual functions.
+- **Deeper ResNets have smaller per-layer response magnitudes**: ResNet-110 < ResNet-56 < ResNet-20. The paper interprets this as "an individual layer of ResNets tends to modify the signal less" when more layers are available.
+- The ranking (bottom plot) shows that even the largest-magnitude residual layers in ResNet-110 are smaller than those in ResNet-20 / ResNet-56.
+
+**Key observation**: Empirical support for the prior that optimal mappings sit near identity (used as motivation for the residual reformulation; cited under O5 and as supporting interpretation for C02).
@@ -0,0 +1,18 @@
+# Derived subset — ResNet-34 shortcut option ablation
+
+**Source**: Derived from Table 3 in "Deep Residual Learning for Image Recognition"
+**Caption**: Subset preserving the rows directly relevant to C04 (shortcut option ablation): plain-34 baseline + ResNet-34 with options A / B / C. Other Table 3 rows are intentionally omitted.
+**Extraction type**: derived_subset
+**Derived from**: `table3_imagenet_validation_full.md`
+
+| model       | shortcut option | top-1 err. | top-5 err. | Δ vs. plain-34 (top-1) |
+|-------------|-----------------|------------|------------|------------------------|
+| plain-34    | none            | 28.54      | 10.02      | —                      |
+| ResNet-34 A | identity + zero-pad | 25.03  | 7.76       | −3.51                  |
+| ResNet-34 B | projection on dim change only | 24.52 | 7.46 | −4.02                  |
+| ResNet-34 C | projection on every shortcut | 24.19 | 7.40 | −4.35                  |
+
+Per-paper interpretation (§"Identity vs. Projection Shortcuts"):
+- All three options are considerably better than plain-34.
+- B is slightly better than A; the paper attributes this to A's zero-padded extra dimensions carrying "no residual learning."
+- C is marginally better than B; the paper attributes this to extra parameters from the 13 projection shortcuts and rejects C as not essential for fixing degradation.
@@ -0,0 +1,15 @@
+# Table 1: Architectures for ImageNet
+
+**Source**: Table 1, §3.3
+**Caption**: "Architectures for ImageNet. Building blocks are shown in brackets (see also Fig. 5), with the numbers of blocks stacked. Downsampling is performed by conv3_1, conv4_1, and conv5_1 with a stride of 2."
+**Extraction type**: raw_table
+
+| layer name | output size | 18-layer | 34-layer | 50-layer | 101-layer | 152-layer |
+|------------|-------------|----------|----------|----------|-----------|-----------|
+| conv1   | 112×112 | 7×7, 64, stride 2 | 7×7, 64, stride 2 | 7×7, 64, stride 2 | 7×7, 64, stride 2 | 7×7, 64, stride 2 |
+| conv2_x | 56×56   | 3×3 max-pool stride 2; [3×3, 64; 3×3, 64] ×2 | [3×3, 64; 3×3, 64] ×3 | [1×1, 64; 3×3, 64; 1×1, 256] ×3 | [1×1, 64; 3×3, 64; 1×1, 256] ×3 | [1×1, 64; 3×3, 64; 1×1, 256] ×3 |
+| conv3_x | 28×28   | [3×3, 128; 3×3, 128] ×2 | [3×3, 128; 3×3, 128] ×4 | [1×1, 128; 3×3, 128; 1×1, 512] ×4 | [1×1, 128; 3×3, 128; 1×1, 512] ×4 | [1×1, 128; 3×3, 128; 1×1, 512] ×8 |
+| conv4_x | 14×14   | [3×3, 256; 3×3, 256] ×2 | [3×3, 256; 3×3, 256] ×6 | [1×1, 256; 3×3, 256; 1×1, 1024] ×6 | [1×1, 256; 3×3, 256; 1×1, 1024] ×23 | [1×1, 256; 3×3, 256; 1×1, 1024] ×36 |
+| conv5_x | 7×7     | [3×3, 512; 3×3, 512] ×2 | [3×3, 512; 3×3, 512] ×3 | [1×1, 512; 3×3, 512; 1×1, 2048] ×3 | [1×1, 512; 3×3, 512; 1×1, 2048] ×3 | [1×1, 512; 3×3, 512; 1×1, 2048] ×3 |
+|         | 1×1     | average pool, 1000-d fc, softmax | average pool, 1000-d fc, softmax | average pool, 1000-d fc, softmax | average pool, 1000-d fc, softmax | average pool, 1000-d fc, softmax |
+| FLOPs   |         | 1.8×10⁹ | 3.6×10⁹ | 3.8×10⁹ | 7.6×10⁹ | 11.3×10⁹ |
@@ -0,0 +1,10 @@
+# Table 2: Top-1 ImageNet validation error — plain vs. ResNet at 18 / 34 layers
+
+**Source**: Table 2, §4.1
+**Caption**: "Top-1 error (%, 10-crop testing) on ImageNet validation. Here the ResNets have no extra parameter compared to their plain counterparts. Fig. 4 shows the training procedures."
+**Extraction type**: raw_table
+
+|            | plain | ResNet |
+|------------|-------|--------|
+| 18 layers  | 27.94 | 27.88  |
+| 34 layers  | 28.54 | **25.03** |
@@ -0,0 +1,18 @@
+# Table 3: Error rates on ImageNet validation (10-crop testing)
+
+**Source**: Table 3, §4.1
+**Caption**: "Error rates (%, 10-crop testing) on ImageNet validation. VGG-16 is based on our test. ResNet-50/101/152 are of option B that only uses projections for increasing dimensions."
+**Extraction type**: raw_table
+
+| model            | top-1 err. | top-5 err. |
+|------------------|------------|------------|
+| VGG-16 [41]      | 28.07      | 9.33       |
+| GoogLeNet [44]   | —          | 9.15       |
+| PReLU-net [13]   | 24.27      | 7.38       |
+| plain-34         | 28.54      | 10.02      |
+| ResNet-34 A      | 25.03      | 7.76       |
+| ResNet-34 B      | 24.52      | 7.46       |
+| ResNet-34 C      | 24.19      | 7.40       |
+| ResNet-50        | 22.85      | 6.71       |
+| ResNet-101       | 21.75      | 6.05       |
+| ResNet-152       | **21.43**  | **5.71**   |
@@ -0,0 +1,18 @@
+# Table 4: Single-model results on ImageNet validation
+
+**Source**: Table 4, §4.1
+**Caption**: "Error rates (%) of single-model results on the ImageNet validation set (except † reported on the test set)."
+**Extraction type**: raw_table
+
+| method            | top-1 err. | top-5 err. |
+|-------------------|------------|------------|
+| VGG [41] (ILSVRC'14) | —       | 8.43†      |
+| GoogLeNet [44] (ILSVRC'14) | — | 7.89    |
+| VGG [41] (v5)     | 24.4       | 7.1        |
+| PReLU-net [13]    | 21.59      | 5.71       |
+| BN-inception [16] | 21.99      | 5.81       |
+| ResNet-34 B       | 21.84      | 5.71       |
+| ResNet-34 C       | 21.53      | 5.60       |
+| ResNet-50         | 20.74      | 5.25       |
+| ResNet-101        | 19.87      | 4.60       |
+| ResNet-152        | **19.38**  | **4.49**   |
@@ -0,0 +1,14 @@
+# Table 5: ImageNet ensembles (top-5 error on the test set)
+
+**Source**: Table 5, §4.1
+**Caption**: "Error rates (%) of ensembles. The top-5 error is on the test set of ImageNet and reported by the test server."
+**Extraction type**: raw_table
+
+| method                       | top-5 err. (test) |
+|------------------------------|-------------------|
+| VGG [41] (ILSVRC'14)         | 7.32              |
+| GoogLeNet [44] (ILSVRC'14)   | 6.66              |
+| VGG [41] (v5)                | 6.8               |
+| PReLU-net [13]               | 4.94              |
+| BN-inception [16]            | 4.82              |
+| **ResNet (ILSVRC'15)**       | **3.57**          |
@@ -0,0 +1,23 @@
+# Table 6: Classification error on the CIFAR-10 test set
+
+**Source**: Table 6, §4.2
+**Caption**: "Classification error on the CIFAR-10 test set. All methods are with data augmentation. For ResNet-110, we run it 5 times and show 'best (mean ± std)' as in [43]."
+**Extraction type**: raw_table
+
+| method        | error (%)            |
+|---------------|----------------------|
+| Maxout [10]   | 9.38                 |
+| NIN [25]      | 8.81                 |
+| DSN [24]      | 8.22                 |
+
+| method        | # layers | # params | error (%)            |
+|---------------|---------:|---------:|----------------------|
+| FitNet [35]   | 19       | 2.5M     | 8.39                 |
+| Highway [42, 43] | 19    | 2.3M     | 7.54 (7.72 ± 0.16)   |
+| Highway [42, 43] | 32    | 1.25M    | 8.80                 |
+| ResNet        | 20       | 0.27M    | 8.75                 |
+| ResNet        | 32       | 0.46M    | 7.51                 |
+| ResNet        | 44       | 0.66M    | 7.17                 |
+| ResNet        | 56       | 0.85M    | 6.97                 |
+| ResNet        | 110      | 1.7M     | **6.43 (6.61 ± 0.16)** |
+| ResNet        | 1202     | 19.4M    | 7.93                 |
@@ -0,0 +1,11 @@
+# Table 7: PASCAL VOC 2007 / 2012 detection (baseline Faster R-CNN)
+
+**Source**: Table 7, §4.3
+**Caption**: "Object detection mAP (%) on the PASCAL VOC 2007/2012 test sets using baseline Faster R-CNN. See also Table 10 and 11 for better results."
+**Extraction type**: raw_table
+
+| training data | 07+12 | 07++12 |
+|---------------|-------|--------|
+| test data     | VOC 07 test | VOC 12 test |
+| VGG-16        | 73.2  | 70.4   |
+| ResNet-101    | **76.4** | **73.8** |
@@ -0,0 +1,10 @@
+# Table 8: COCO detection (baseline Faster R-CNN)
+
+**Source**: Table 8, §4.3
+**Caption**: "Object detection mAP (%) on the COCO validation set using baseline Faster R-CNN. See also Table 9 for better results."
+**Extraction type**: raw_table
+
+| metric    | mAP@.5 | mAP@[.5, .95] |
+|-----------|--------|---------------|
+| VGG-16    | 41.5   | 21.2          |
+| ResNet-101 | **48.4** | **27.2**   |
@@ -0,0 +1,81 @@
+# Claims
+
+## C01: Plain CNNs exhibit a depth-induced degradation problem
+- **Statement**: For sufficiently deep "plain" CNNs (no shortcuts), increasing depth strictly increases *training* error on both CIFAR-10 and ImageNet, even with BN and competent initialization.
+- **Status**: supported
+- **Falsification criteria**: A controlled depth scan (e.g. plain-{18, 34, 56, 110}) trained with BN and standard SGD in which deeper models show monotonically lower or equal training error.
+- **Proof**: [E01]
+- **Evidence basis**: Table 2 (plain-18 = 27.94%, plain-34 = 28.54% top-1 ImageNet val); Fig. 4 left (training-error curves cross with deeper plain-34 above plain-18 throughout training); Fig. 6 left for plain-{20, 32, 44, 56, 110} on CIFAR-10.
+- **Interpretation**: The authors argue (but do not formally prove) that this reflects an *optimization* difficulty rather than overfitting or vanishing gradients.
+- **Dependencies**: none
+- **Tags**: degradation, optimization, depth-scaling, plain-baseline
+
+## C02: Residual learning eliminates the degradation problem
+- **Statement**: Replacing each pair of stacked 3×3 layers in a plain net with a residual block F(x) + x (with identity shortcut, no extra parameters) makes the deeper variant achieve *lower* training and validation error than the shallower one for matched depths.
+- **Status**: supported
+- **Falsification criteria**: Under the same depth/width/training pipeline, ResNet-34 fails to improve over ResNet-18 on ImageNet validation, or ResNet-34 has higher training error than plain-34.
+- **Proof**: [E01, E02]
+- **Evidence basis**: Table 2 (ResNet-18 = 27.88, ResNet-34 = 25.03 top-1 ImageNet val; ResNet-34 better than ResNet-18 by 2.85 pts; ResNet-34 better than plain-34 by 3.51 pts); Fig. 4 right (training-error curves of ResNet-34 lie below ResNet-18 throughout training).
+- **Interpretation**: The result is consistent with the hypothesis that residual reformulation makes the optimization landscape easier to traverse, but does not by itself prove a representational advantage.
+- **Dependencies**: C01
+- **Tags**: residual-learning, identity-shortcut, optimization
+
+## C03: Residual networks gain accuracy from increased depth up to 152 layers on ImageNet
+- **Statement**: Deeper ResNets (50, 101, 152) achieve monotonically lower top-1 and top-5 ImageNet validation error than shallower ResNets, with the 152-layer model still having lower complexity (11.3 GFLOPs) than VGG-16/19 (15.3/19.6 GFLOPs).
+- **Status**: supported
+- **Falsification criteria**: A ResNet-152 trained with the same recipe fails to improve top-1 over ResNet-101 (or ResNet-101 over ResNet-50) by a margin larger than the noise floor (~0.1%).
+- **Proof**: [E03]
+- **Evidence basis**: Table 3 (ResNet-50 = 22.85 / 6.71, ResNet-101 = 21.75 / 6.05, ResNet-152 = 21.43 / 5.71 top-1 / top-5 with 10-crop testing; FLOPs from Table 1).
+- **Interpretation**: Within the depths studied, depth alone (rather than added parameters) is the source of the gain because residual blocks add no extra parameters relative to plain counterparts.
+- **Dependencies**: C02
+- **Tags**: depth-scaling, imagenet, complexity
+
+## C04: Identity shortcuts are sufficient; projection shortcuts give only marginal gains
+- **Statement**: Among shortcut options A (zero-padding identity), B (projection only when dimensions change), and C (projection on every shortcut), the differences in ImageNet top-1 error are small (≤0.65 pts on ResNet-34); identity shortcuts (A) suffice to fix degradation, and option C is rejected as not worth its parameter / memory cost.
+- **Status**: supported
+- **Falsification criteria**: A controlled comparison in which option C beats option A or B by more than ~1 top-1 point under identical training, indicating projection shortcuts are essential rather than convenience.
+- **Proof**: [E04]
+- **Evidence basis**: Table 3 (ResNet-34 A = 25.03, B = 24.52, C = 24.19 top-1 with 10-crop); §"Identity vs. Projection Shortcuts" attributes the small B>A gap to A's zero-padded dimensions having "no residual learning" and the small C>B gap to extra parameters from 13 projection shortcuts.
+- **Interpretation**: Identity shortcuts are the right default for parameter efficiency; B is used in deeper bottleneck nets where dimension changes are rarer.
+- **Dependencies**: C02
+- **Tags**: shortcut-design, ablation
+
+## C05: Bottleneck blocks make 50/101/152-layer ResNets practical
+- **Statement**: Replacing the 2-layer 3×3 building block with a 3-layer 1×1 → 3×3 → 1×1 *bottleneck* block of the same per-block time complexity allows construction of 50/101/152-layer ResNets that achieve lower error than the 34-layer ResNet without exploding compute.
+- **Status**: supported
+- **Falsification criteria**: A 50- or 101-layer non-bottleneck ResNet matches the bottleneck ResNet at equal compute, eliminating the practical need for bottlenecks; or bottleneck ResNets fail to improve over ResNet-34.
+- **Proof**: [E03]
+- **Evidence basis**: Table 1 (3.8/7.6/11.3 GFLOPs for ResNet-{50,101,152}, comparable to ResNet-34 at 3.6 GFLOPs); Table 3 top-1 error drops from 24.19 (ResNet-34 C) to 22.85 / 21.75 / 21.43 for ResNet-50/101/152.
+- **Interpretation**: Identity shortcuts are particularly important for bottleneck designs because a projection shortcut on a bottleneck doubles its time complexity and model size (§"Deeper Bottleneck Architectures").
+- **Dependencies**: C02, C04
+- **Tags**: bottleneck, architecture, complexity
+
+## C06: Residual nets generalize to extreme CIFAR-10 depths (110 layers; 1202 layers without optimization difficulty)
+- **Statement**: On CIFAR-10, ResNets at depths {20, 32, 44, 56, 110} all train successfully, with the 110-layer model achieving 6.43% test error (best mean ± std 6.61 ± 0.16); a 1202-layer ResNet trains with no optimization difficulty (final training error <0.1%) although it overfits to 7.93% test error on this small dataset.
+- **Status**: supported
+- **Falsification criteria**: A CIFAR ResNet at depth ≥110 trained with the same recipe fails to converge to <10% test error, or its training error fails to decrease below the 56-layer model's.
+- **Proof**: [E05]
+- **Evidence basis**: Table 6 (ResNet-{20=8.75, 32=7.51, 44=7.17, 56=6.97, 110=6.43, 1202=7.93}% test error); §"Exploring Over 1000 layers" notes 1202-layer training error <0.1% with no optimization difficulty.
+- **Interpretation**: The 1202-layer model worsens on test only because of overfitting on a 50k-image dataset, not because optimization breaks down.
+- **Dependencies**: C02, C03
+- **Tags**: cifar-10, extreme-depth, generalization, overfitting
+
+## C07: Warming up the learning rate is necessary for the 110-layer CIFAR ResNet
+- **Statement**: A 110-layer ResNet on CIFAR-10 fails to start converging cleanly with the default initial LR of 0.1; warming up at LR 0.01 for ~400 iterations until training error drops below ~80%, then restoring LR 0.1, restores convergence.
+- **Status**: supported
+- **Falsification criteria**: Training a 110-layer ResNet on CIFAR-10 from scratch at LR 0.1 from iteration 0 reliably reaches the same final test error as the warmup recipe under the same total budget.
+- **Proof**: [E05]
+- **Evidence basis**: §4.2 paragraph on n=18 (110-layer): "0.1 is slightly too large to start converging" with footnote 5 noting LR 0.1 reaches similar accuracy after several epochs of >90% error but the warmup variant is the chosen recipe.
+- **Interpretation**: Warmup is a stability heuristic, not a fundamental requirement of residual learning — only a minor optimization aid for very deep CIFAR variants.
+- **Dependencies**: C06
+- **Tags**: training-recipe, warmup, very-deep
+
+## C08: ResNet representations transfer to detection, giving large COCO gains over VGG-16
+- **Statement**: Replacing the VGG-16 backbone with ResNet-101 in baseline Faster R-CNN improves COCO val mAP@[.5,.95] from 21.2 to 27.2, a 6.0-point absolute (28% relative) increase, attributed solely to the better learned representations.
+- **Status**: supported
+- **Falsification criteria**: A controlled VGG-16 → ResNet-101 swap in Faster R-CNN with the same hyperparameters fails to improve COCO mAP@[.5,.95] by ≥3 absolute points.
+- **Proof**: [E06]
+- **Evidence basis**: Table 8 (baseline Faster R-CNN: VGG-16 = 41.5 mAP@.5 / 21.2 mAP@[.5,.95]; ResNet-101 = 48.4 / 27.2 on COCO val); Table 7 (PASCAL VOC 07 mAP: 73.2 → 76.4; VOC 12: 70.4 → 73.8).
+- **Interpretation**: Depth of representations matters not only for classification but for downstream localization-sensitive tasks; mAP@[.5,.95] (which rewards tighter boxes) gains ≈ mAP@.5 gains, suggesting deeper features help both recognition and localization.
+- **Dependencies**: C03
+- **Tags**: transfer-learning, object-detection, coco, pascal-voc, faster-rcnn
@@ -0,0 +1,49 @@
+# Concepts
+
+## Residual Mapping
+- **Notation**: $\mathcal{F}(\mathbf{x}) := \mathcal{H}(\mathbf{x}) - \mathbf{x}$, with the original mapping recovered as $\mathcal{H}(\mathbf{x}) = \mathcal{F}(\mathbf{x}) + \mathbf{x}$.
+- **Definition**: An auxiliary mapping fit by a few stacked nonlinear layers, expressing the *difference* between a desired underlying mapping $\mathcal{H}$ and its input $\mathbf{x}$. The block as a whole computes $\mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}$.
+- **Boundary conditions**: Only well-defined for blocks whose input and output have matching dimensions; for dimension changes a linear projection $W_s$ is introduced (Eqn. 2). A single-layer $\mathcal{F}$ degenerates to a linear layer with no observed advantage; $\mathcal{F}$ should have ≥2 layers (§3.2).
+- **Related concepts**: Identity Shortcut, Projection Shortcut, Plain Network.
+
+## Identity Shortcut (Option A)
+- **Notation**: $\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}$.
+- **Definition**: A parameter-free skip connection that adds the input of a residual block directly to its output. When dimensions increase across the shortcut, missing channels are filled with zero padding (no learnable parameters).
+- **Boundary conditions**: Requires $\dim(\mathbf{x}) = \dim(\mathcal{F}(\mathbf{x}))$, or zero-padding for the extra channels. Adds neither parameters nor computational cost beyond an element-wise add.
+- **Related concepts**: Projection Shortcut, Residual Mapping, Bottleneck Block.
+
+## Projection Shortcut (Option B / Option C)
+- **Notation**: $\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + W_s \mathbf{x}$.
+- **Definition**: A 1×1 convolutional shortcut that linearly projects $\mathbf{x}$ to match the output dimension. Option B uses projections only when dimensions change; option C uses projections on every shortcut.
+- **Boundary conditions**: $W_s$ adds parameters and FLOPs proportional to channel count. In bottleneck blocks, replacing identity with a projection roughly *doubles* the block's time complexity and model size, so it is reserved for dimension-changing shortcuts.
+- **Related concepts**: Identity Shortcut, Bottleneck Block.
+
+## Bottleneck Residual Block
+- **Notation**: 1×1, $C_\text{in} \to C_\text{mid}$ → 3×3, $C_\text{mid} \to C_\text{mid}$ → 1×1, $C_\text{mid} \to C_\text{out}$, with $C_\text{out} = 4 C_\text{mid}$ (Fig. 5 right).
+- **Definition**: A 3-layer residual building block whose first 1×1 reduces channels, the 3×3 operates on the bottleneck, and the second 1×1 restores the high-dimensional output. Used in ResNet-50/101/152 in place of the 2-layer 3×3 block.
+- **Boundary conditions**: Designed to keep per-block compute comparable to the 2-layer block while permitting much greater depth. Identity shortcuts are particularly important here because projection shortcuts on the high-dimensional ends are expensive.
+- **Related concepts**: Residual Mapping, Identity Shortcut, Plain Network.
+
+## Degradation Problem
+- **Notation**: For network depth $d$, training error $\epsilon_\text{train}(d)$ ceases to be monotonically non-increasing as $d$ grows.
+- **Definition**: An empirical phenomenon where deeper plain CNNs reach *higher* training error than shallower counterparts that are nominally a subspace of the deeper architecture (§1).
+- **Boundary conditions**: Observed in plain networks even when BN, MSRA initialization, and SGD with momentum are used. The paper attributes it to *optimization* difficulty, not vanishing/exploding gradients (§4.1).
+- **Related concepts**: Plain Network, Residual Mapping, Identity Mapping by Shortcuts.
+
+## Plain Network
+- **Notation**: Sequential composition of conv → BN → ReLU layers without skip connections, sharing the VGG-style design rules: same output map size ⇒ same #filters; halving spatial size ⇒ doubling #filters (§3.3).
+- **Definition**: The non-residual baseline architecture used to isolate the effect of residual learning. The 34-layer plain net has 3.6 GFLOPs (≈18% of VGG-19's 19.6 GFLOPs).
+- **Boundary conditions**: Used purely as a control. Plain-{18,34} on ImageNet and plain-{20,32,44,56,110} on CIFAR-10 are the studied depths.
+- **Related concepts**: Residual Network, Degradation Problem.
+
+## Batch Normalization (BN)
+- **Notation**: $y = \gamma \cdot (x - \mu_B) / \sigma_B + \beta$, normalization computed per channel over a mini-batch.
+- **Definition**: Per-channel normalization placed *right after each convolution and before activation* (§3.4). Used for all plain and residual nets in this paper, with no dropout.
+- **Boundary conditions**: At inference time, running statistics are used; for the COCO/PASCAL detection fine-tuning experiments, BN statistics are *frozen* after pre-training and BN behaves as an affine transform (Appendix A).
+- **Related concepts**: Plain Network, MSRA Initialization (RW04).
+
+## 10-Crop Testing
+- **Notation**: Forward 10 fixed crops per image (4 corner crops + 1 center crop, each with horizontal flip) and average the predicted class probabilities.
+- **Definition**: Standard test-time augmentation following Krizhevsky et al. used to report ImageNet validation error in Tables 2, 3.
+- **Boundary conditions**: Distinct from "single-model" results in Table 4, which additionally use fully-convolutional multi-scale testing at scales {224, 256, 384, 480, 640}; and from the test-set ensemble result of 3.57% top-5 in Table 5.
+- **Related concepts**: Multi-scale Testing, Ensembling.
@@ -0,0 +1,119 @@
+# Experiments
+
+## E01: Plain vs. residual at matched depth on ImageNet (18 vs. 34 layers)
+- **Verifies**: C01, C02
+- **Setup**:
+  - Model: 18-layer and 34-layer plain CNNs (VGG-style, 3.6 GFLOPs at 34 layers); residual counterparts adding identity shortcuts (Option A: zero-padding for dimension changes) — same parameter count.
+  - Hardware: GPU(s) — paper does not specify count for ImageNet, mini-batch size = 256.
+  - Dataset: ImageNet 2012 — 1.28M training images, 50k validation, 100k test (1000 classes).
+  - System: BN after every conv, MSRA initialization, SGD momentum 0.9, weight decay 1e-4, LR start 0.1 with /10 step on plateau, up to 60×10⁴ iterations, no dropout. Standard color/scale augmentation; 224×224 random crop on shorter-side ∈ [256, 480].
+- **Procedure**:
+  1. Train plain-{18, 34} from scratch with the recipe above.
+  2. Train ResNet-{18, 34} (option A shortcuts) with identical hyperparameters.
+  3. Evaluate top-1 / top-5 error on the 50k validation set with 10-crop testing.
+  4. Compare full training-error and validation-error trajectories (Fig. 4).
+- **Metrics**: Top-1 error (%), top-5 error (%), and training-error trajectory.
+- **Expected outcome**:
+  - Plain-34 has *higher* validation error than plain-18 throughout training (degradation).
+  - ResNet-34 has *lower* validation and training error than ResNet-18 (degradation removed).
+  - ResNet-34 outperforms plain-34 by a meaningful margin.
+- **Baselines**: Plain-{18, 34} (ablation control); each net is its own residual counterpart's control.
+- **Dependencies**: none.
+
+## E02: Training-trajectory comparison plain vs. residual on ImageNet
+- **Verifies**: C01, C02
+- **Setup**:
+  - Model: Same plain-{18, 34} and ResNet-{18, 34} as E01.
+  - Hardware: same as E01.
+  - Dataset: ImageNet 2012 (training set tracked; validation tracked at intervals).
+  - System: same training pipeline.
+- **Procedure**:
+  1. During training, log thin curves for training error and bold curves for validation error (center-crop) for each model, sampled at iter intervals up to ~60×10⁴ (Fig. 4 axes go to 50×10⁴ shown).
+  2. Compare residual vs. plain trajectories side-by-side at depths 18 and 34.
+- **Metrics**: Training error vs. iteration, validation (center-crop) error vs. iteration.
+- **Expected outcome**:
+  - Plain-34's training-error curve sits *above* plain-18's throughout training (degradation, not just a final-step issue).
+  - ResNet-34's curve sits *below* ResNet-18's (residual easing).
+  - ResNet-18 converges *faster* than plain-18 in early iterations even though their final accuracy is similar, indicating optimization easing.
+- **Baselines**: Plain-{18, 34} curves act as the controls.
+- **Dependencies**: E01.
+
+## E03: Depth scan with bottleneck blocks on ImageNet (50 / 101 / 152 layers)
+- **Verifies**: C03, C05
+- **Setup**:
+  - Model: ResNet-{50, 101, 152} built from 1×1 → 3×3 → 1×1 bottleneck blocks (Fig. 5 right) with option B shortcuts (projection only on dimension changes), per Table 1.
+  - Hardware: GPU(s) per E01.
+  - Dataset: ImageNet 2012.
+  - System: Same SGD recipe as E01.
+- **Procedure**:
+  1. Train ResNet-50, ResNet-101, ResNet-152 with the recipe in E01.
+  2. Evaluate with 10-crop testing on the 50k validation set; compare top-1 and top-5 error (Table 3).
+  3. Evaluate the single-model multi-scale fully-convolutional variant for Table 4.
+  4. Form an ensemble of six different-depth ResNets and evaluate on the ImageNet test set (Table 5).
+- **Metrics**: Top-1 / top-5 error (%) at each depth; FLOPs (Table 1); test-set top-5 error for the ensemble.
+- **Expected outcome**:
+  - Top-1 / top-5 error decreases monotonically from ResNet-50 → 101 → 152.
+  - All three are more accurate than ResNet-34 by considerable margins.
+  - The ResNet-152 single-model multi-scale result is more accurate than every prior published single-model on ImageNet.
+  - The 6-model ensemble outperforms all prior ensembles, taking 1st place in ILSVRC 2015 classification.
+- **Baselines**: ResNet-34 (within-family); VGG, GoogLeNet, PReLU-net, BN-inception (across-family).
+- **Dependencies**: E01.
+
+## E04: Identity vs. projection shortcut ablation on ResNet-34
+- **Verifies**: C04
+- **Setup**:
+  - Model: ResNet-34 with three shortcut variants — A (identity + zero-pad for dim changes, parameter-free), B (projections only when dimensions change), C (projections on every shortcut).
+  - Hardware: GPU(s) per E01.
+  - Dataset: ImageNet 2012.
+  - System: Same training recipe as E01.
+- **Procedure**:
+  1. Train ResNet-34 A, B, C from scratch with identical hyperparameters.
+  2. Evaluate top-1 / top-5 error on the 50k validation set with 10-crop testing.
+  3. Compare to plain-34 to quantify the residual learning gain at each shortcut variant.
+- **Metrics**: Top-1 / top-5 error (%); parameter count delta from extra projection shortcuts.
+- **Expected outcome**:
+  - All three options outperform plain-34 by a sizable margin.
+  - Differences among A, B, C are small (within ~1 top-1 point), with C marginally best, B slightly better than A.
+  - Conclusion: identity shortcuts suffice for the degradation problem; option C's modest gain does not justify the extra parameters / memory.
+- **Baselines**: Plain-34.
+- **Dependencies**: E01.
+
+## E05: CIFAR-10 depth scan and 1202-layer stress test
+- **Verifies**: C06, C07
+- **Setup**:
+  - Model: CIFAR ResNet family with 6n+2 layers — n ∈ {3, 5, 7, 9, 18, 200} ⇒ depths {20, 32, 44, 56, 110, 1202}; 16/32/64 filters across the three feature-map sizes; option A identity shortcuts; ~0.27M params (n=3) up to 19.4M params (n=200).
+  - Hardware: 2 GPUs.
+  - Dataset: CIFAR-10 — 50k training, 10k test, 32×32 images, 10 classes; per-pixel mean subtraction; 4-pixel padded random crop and horizontal flip augmentation following [24].
+  - System: SGD momentum 0.9, weight decay 1e-4, MSRA init, BN, no dropout, mini-batch 128, 45k/5k train/val split, LR 0.1 with /10 at iter 32k and 48k, terminate at 64k iters. For the 110-layer net: warm up at LR 0.01 for ~400 iters until training error <80%, then restore LR 0.1.
+- **Procedure**:
+  1. Train ResNet-{20, 32, 44, 56, 110} on CIFAR-10 with the recipe above; train plain-{20, 32, 44, 56, 110} as controls.
+  2. Run the 110-layer ResNet 5 times and report mean ± std (best result form for Table 6).
+  3. Train ResNet-1202 with the same recipe (no warmup needed at this depth per §"Exploring Over 1000 layers", since the n=18 warmup is the only special case noted).
+  4. Compare to FitNet, Highway, Maxout, NIN, DSN baselines (Table 6).
+- **Metrics**: Test-set classification error (%); parameter count (#params); training-error trajectory.
+- **Expected outcome**:
+  - ResNet test error decreases as depth grows from 20 → 110.
+  - ResNet-1202 trains successfully (training error <0.1%) but its test error worsens versus the 110-layer model, indicating overfitting on this small dataset.
+  - Plain-{56, 110} suffer the degradation problem; plain-110 is even reported to have >60% error and is not displayed.
+- **Baselines**: Plain-{20, 32, 44, 56, 110}; Maxout, NIN, DSN, FitNet, Highway.
+- **Dependencies**: E01 (for the residual recipe pattern).
+
+## E06: COCO and PASCAL VOC detection transfer with ResNet-101
+- **Verifies**: C08
+- **Setup**:
+  - Model: Faster R-CNN with two backbones — VGG-16 and ResNet-101. ResNet-101 is fine-tuned per Appendix A: full-image shared conv features through conv4_x; RoI pooling before conv5_x; conv5_x and up act as VGG's fc layers; final classification/box regression replaced. BN layers frozen during fine-tuning.
+  - Hardware: 8 GPUs; mini-batch 8 images for RPN step (1/GPU), 16 for Fast R-CNN step.
+  - Dataset: PASCAL VOC 2007 (5k trainval) + VOC 2012 (16k trainval) for "07+12"; "07++12" adds 10k VOC07 trainval+test for VOC12 evaluation. COCO 80k train + 40k val (val for evaluation in Tables 7/8/9).
+  - System: Faster R-CNN baseline hyperparameters from [32]; Detection-network LR 0.001 for 240k iters, then 0.0001 for 80k iters; 4-step alternating training.
+- **Procedure**:
+  1. Replace VGG-16 with ResNet-101 in baseline Faster R-CNN; fine-tune on PASCAL VOC 07+12 and report VOC07 test mAP@.5 (Table 7).
+  2. Fine-tune the same model on VOC 07++12 and report VOC12 test mAP@.5 (Table 7).
+  3. Train on COCO train, evaluate on COCO val and report mAP@.5 and mAP@[.5,.95] (Table 8 baseline rows).
+  4. Add box refinement, global context, multi-scale testing, ensemble (Table 9) for the competition entry.
+- **Metrics**: PASCAL VOC mAP@.5 (%); COCO mAP@.5 (%); COCO mAP@[.5,.95] (%).
+- **Expected outcome**:
+  - ResNet-101 outperforms VGG-16 by a clear margin on every detection metric.
+  - The COCO mAP@[.5,.95] gain (≥6 absolute points, ≥28% relative) is comparable to the gain on the looser COCO mAP@.5 (≈6.9 absolute), suggesting deeper features help both recognition and localization.
+  - Adding box refinement, context, multi-scale, and ensemble further boosts COCO test-dev to >55% mAP@.5 / >34% mAP@[.5,.95].
+- **Baselines**: VGG-16 backbone in Faster R-CNN.
+- **Dependencies**: E03 (ResNet-101 must first be trained on ImageNet).
@@ -0,0 +1,53 @@
+# Problem Specification
+
+## Observations
+
+### O1: Plain CNNs degrade with depth on CIFAR-10
+- **Statement**: A 56-layer plain CNN reaches *higher* training error and *higher* test error than a 20-layer plain CNN on CIFAR-10 (Fig. 1, §1).
+- **Evidence**: Fig. 1 (training/test error curves vs. iterations) and Fig. 6 (left) for plain-{20,32,44,56,110}.
+- **Implication**: Adding layers to a working network does not monotonically improve — and sometimes hurts — even training accuracy, contradicting the construction argument that a deeper net can at least match a shallower one by learning identity in the extra layers.
+
+### O2: Same degradation appears on ImageNet
+- **Statement**: 34-layer plain net obtains 28.54% top-1 ImageNet validation error vs. 27.94% for the 18-layer plain net — the deeper plain net is worse despite having strictly more capacity (Table 2, Fig. 4 left).
+- **Evidence**: Table 2 (plain vs. ResNet, 18/34 layers); Fig. 4 left (plain training/validation curves).
+- **Implication**: The degradation is not specific to CIFAR-scale data and is observed throughout the whole training trajectory, not just at the end.
+
+### O3: Degradation is not caused by vanishing gradients
+- **Statement**: Plain networks here use Batch Normalization, which keeps forward signals at non-zero variance, and backward gradient norms are verified to be healthy. The 34-layer plain net is "still able to achieve competitive accuracy" (24.19% top-1 with 10-crop testing, Table 3), so SGD is making progress, just slowly.
+- **Evidence**: §4.1 "Plain Networks" discussion; Table 3 (plain-34 = 28.54% top-1 / 10.02% top-5).
+- **Implication**: The bottleneck is *optimization difficulty* — exponentially low convergence rates conjectured — not signal collapse.
+
+### O4: Identity-mapping construction proves a deeper net *should* be at least as good
+- **Statement**: Given a shallow network, one can construct a deeper one by appending identity layers; this deeper construction has, by definition, training error ≤ the shallow net's. Yet SGD does not find it (§1, §3.1).
+- **Evidence**: §1 introduction argument.
+- **Implication**: Solvers struggle to approximate identity mappings via stacks of nonlinear layers, motivating an architecture that makes identity easy to express.
+
+### O5: Layer responses in plain nets have larger magnitudes than in ResNets
+- **Statement**: Std of 3×3 layer outputs (after BN, before ReLU) is consistently smaller for ResNets than for plain nets, and gets smaller as depth grows (Fig. 7).
+- **Evidence**: Fig. 7 (CIFAR-10 layer-response std for plain-{20,56,110} vs. ResNet-{20,56,110}).
+- **Implication**: Empirical support for the prior that, in real tasks, optimal mappings are closer to identity than to zero — so a residual parameterization is a better-conditioned starting point.
+
+## Gaps
+
+### G1: No way to train networks much deeper than ~20–30 layers
+- **Statement**: As of 2015, leading ImageNet models top out at depths in the teens (VGG-19, GoogLeNet); naive deeper variants degrade (O1, O2).
+- **Caused by**: O1, O2, O3, O4.
+- **Existing attempts**: Better initialization (Xavier/MSRA), Batch Normalization, intermediate auxiliary classifiers, highway networks with gated shortcuts.
+- **Why they fail**: They allow tens-of-layer nets to converge but do not enable monotonic accuracy gains with extreme depth. Highway gates are data-dependent and have parameters; when a gate closes, the layer behaves non-residually, and highway networks have not shown gains beyond ~100 layers.
+
+### G2: No parameter-free, drop-in mechanism to bias optimization toward near-identity solutions
+- **Statement**: Existing shortcut variants either add parameters (projection / gated) or change the function class (highway).
+- **Caused by**: O4, O5.
+- **Existing attempts**: Linear / projection shortcuts, gated skip connections.
+- **Why they fail**: They couple the shortcut to additional learnable parameters or close the residual path entirely.
+
+## Key Insight
+- **Insight**: Reformulate each block to fit a *residual* mapping F(x) := H(x) − x, so the original mapping is recovered as F(x) + x via a parameter-free identity shortcut. If the optimal mapping is close to identity, the solver only needs to push F toward zero, which is empirically easier than fitting H from scratch (§3.1).
+- **Derived from**: O3, O4, O5.
+- **Enables**: End-to-end training of networks with 50, 101, 152, and even 1202 layers without auxiliary losses, without changing the optimizer, and without adding parameters relative to the plain counterpart.
+
+## Assumptions
+- A1: Multiple stacked nonlinear layers can asymptotically approximate complex functions (and, by the same hypothesis, residual functions). This is noted as still an open question (footnote 2).
+- A2: Optimal mappings in real recognition tasks are closer to identity than to a generic random function — supported a posteriori by Fig. 7.
+- A3: Standard SGD with momentum and BN is sufficient as the optimizer; no second-order or specialized solver is required.
+- A4: Identity shortcuts can be applied wherever input and output dimensions agree; dimension-changing shortcuts use either zero-padding (option A) or 1×1 projections (option B/C).
@@ -0,0 +1,149 @@
+# Related Work
+
+## RW01: Highway Networks (Srivastava, Greff & Schmidhuber, 2015)
+- **DOI**: arXiv:1505.00387 (Highway), arXiv:1507.06228 (Training very deep nets)
+- **Type**: refutes (the gating choice)
+- **Delta**:
+  - What changed: Replace highway's *data-dependent gated* shortcuts with parameter-free identity shortcuts.
+  - Why: When a highway gate "closes" (approaches zero) the layer becomes non-residual; highway networks have not demonstrated accuracy gains beyond ~100 layers.
+- **Claims affected**: C02, C03, C06.
+- **Adopted elements**: The general idea of skip connections, but the gating mechanism is rejected.
+
+## RW02: VGG (Simonyan & Zisserman, 2015)
+- **DOI**: arXiv:1409.1556 (refs [41]; "very deep CNNs for large-scale image recognition")
+- **Type**: baseline
+- **Delta**:
+  - What changed: Adopt VGG's design philosophy of stacked 3×3 convs and the "same map size ⇒ same #filters; halved size ⇒ doubled #filters" rule, but at much greater depth and lower complexity (3.6 GFLOPs at 34 layers vs. VGG-19's 19.6 GFLOPs).
+  - Why: VGG provides the cleanest plain-net baseline against which residual learning can be measured.
+- **Claims affected**: C01, C03, C08.
+- **Adopted elements**: VGG-style design rules; reference numbers for FLOPs comparison.
+
+## RW03: Batch Normalization (Ioffe & Szegedy, 2015)
+- **DOI**: ref [16] in the paper
+- **Type**: imports
+- **Delta**:
+  - What changed: BN is applied after every conv and before activation in *both* plain and residual nets. The paper uses BN to verify that vanishing forward signals are *not* the cause of degradation.
+  - Why: BN is the standard tool for preventing signal collapse in deep nets and isolates the optimization-difficulty hypothesis.
+- **Claims affected**: C01, C02, C03.
+- **Adopted elements**: BN architecture; "no dropout" recipe combined with BN (also from this work).
+
+## RW04: MSRA initialization (He, Zhang, Ren & Sun, 2015 — "Delving deep into rectifiers")
+- **DOI**: ref [13] in the paper
+- **Type**: imports
+- **Delta**:
+  - What changed: Initialize all conv weights with the MSRA scheme. The exact same init is used for plain and residual variants.
+  - Why: Pairing PReLU-aware init with BN gives the cleanest starting point for the depth ablation.
+- **Claims affected**: C01, C02, C06.
+- **Adopted elements**: Weight initialization scheme.
+
+## RW05: GoogLeNet / Inception (Szegedy et al., 2015)
+- **DOI**: ref [44] in the paper
+- **Type**: baseline
+- **Delta**:
+  - What changed: Compare against GoogLeNet on ILSVRC'14 (top-5 9.15) and Going Deeper (7.89) without adopting Inception's branching topology.
+  - Why: A second strong reference point alongside VGG; GoogLeNet uses an "inception layer" composed of a shortcut branch and a few deeper branches.
+- **Claims affected**: C03 (state-of-the-art comparison).
+- **Adopted elements**: None architecturally; only the comparison.
+
+## RW06: PReLU-net / "Surpassing human-level on ImageNet" (He et al., 2015)
+- **DOI**: ref [13] (same group)
+- **Type**: baseline
+- **Delta**:
+  - What changed: Used as the prior single-model state of the art on ImageNet validation (24.27 top-1 / 7.38 top-5 in Table 4). ResNet-152 single-model surpasses it (19.38 / 4.49).
+  - Why: Direct comparison point at single-model level.
+- **Claims affected**: C03.
+- **Adopted elements**: None.
+
+## RW07: BN-Inception (Ioffe & Szegedy, 2015)
+- **DOI**: ref [16]
+- **Type**: baseline
+- **Delta**:
+  - What changed: Used as the prior single-model SOTA on ImageNet (21.99 / 5.81 in Table 4) and as the prior best ensemble (4.82 in Table 5).
+  - Why: Most competitive single-model and ensemble baselines available at the time.
+- **Claims affected**: C03.
+- **Adopted elements**: None architecturally; ensemble/single-model comparison points.
+
+## RW08: Residual representations — VLAD / Fisher Vector / Encoding residual vectors
+- **DOI**: refs [18], [17] (residual encoding); refs [4], [48] (VLAD/Fisher use)
+- **Type**: extends
+- **Delta**:
+  - What changed: Generalize the *residual representation* idea (encoding residuals rather than originals, e.g., VLAD encoding by residual vectors w.r.t. a dictionary) to deep learning by reformulating each layer as a residual function.
+  - Why: Provides motivation that residual representations can be "more effective" for retrieval/classification (refs [4], [48]) and have a long history.
+- **Claims affected**: C02.
+- **Adopted elements**: Conceptual motivation only.
+
+## RW09: Multigrid / hierarchical basis preconditioning (Briggs et al., 2000; Szeliski 2006/2010)
+- **DOI**: refs [3], [45], [46]
+- **Type**: extends
+- **Delta**:
+  - What changed: Carry the low-level vision insight that solvers operating on residual variables converge much faster than solvers unaware of the residual nature into deep CNNs.
+  - Why: Provides a precedent that "good reformulation or preconditioning can simplify the optimization."
+- **Claims affected**: C02 (motivation for why residual easing is plausible).
+- **Adopted elements**: Conceptual motivation only.
+
+## RW10: Faster R-CNN (Ren, He, Girshick & Sun, 2015)
+- **DOI**: ref [32]
+- **Type**: imports
+- **Delta**:
+  - What changed: Use Faster R-CNN as the detection framework, swapping the VGG-16 backbone with ResNet-101 (Appendix A). Otherwise identical hyperparameters in the baseline rows.
+  - Why: Isolates the contribution of the backbone (i.e., the learned representation).
+- **Claims affected**: C08.
+- **Adopted elements**: 4-step alternating training, RPN + Fast R-CNN, anchor design.
+
+## RW11: Networks on Conv feature maps / NoC (Ren, He, Girshick & Sun, 2015)
+- **DOI**: ref [33]
+- **Type**: imports
+- **Delta**:
+  - What changed: Use the NoC idea to share full-image conv features through layers with stride ≤16 and treat conv5_x as the per-RoI fc-equivalent in ResNet-101 Faster R-CNN.
+  - Why: Lets ResNet-101 (which lacks hidden fc layers) plug into the Faster R-CNN architecture cleanly.
+- **Claims affected**: C08.
+- **Adopted elements**: Conv-feature-sharing strategy.
+
+## RW12: PASCAL VOC (Everingham et al., 2010); ImageNet (Russakovsky et al., 2015); COCO (Lin et al., 2014)
+- **DOI**: refs [5], [36], [26]
+- **Type**: imports (datasets)
+- **Delta**: Used as the evaluation datasets for ImageNet classification, ImageNet localization, COCO detection, and PASCAL VOC detection benchmarks.
+- **Claims affected**: C03, C06, C08.
+- **Adopted elements**: Standard splits and evaluation protocols.
+
+## Briefer citations (no specific technical delta, captured for citation footprint)
+
+- [1] Bengio, Simard & Frasconi, "Learning long-term dependencies with gradient descent is difficult" (1994) — historical reference for gradient-vanishing problem (§1).
+- [2] Bishop, *Neural networks for pattern recognition* (1995) — historical reference on shortcut connections.
+- [6] Gidaris & Komodakis, "Object detection via multi-region & semantic segmentation-aware CNN" (2015) — context for VOC2012 SOTA discussion in Appendix B.
+- [7] Girshick, "Fast R-CNN" (2015) — Fast R-CNN ROI pooling, used inside Faster R-CNN.
+- [8] Girshick et al., "Rich feature hierarchies (R-CNN)" (2014) — R-CNN, referenced as the precursor pipeline used in ImageNet localization (Appendix C).
+- [9] Glorot & Bengio, "Understanding the difficulty of training deep feedforward NNs" (AISTATS 2010) — Xavier init reference, baseline for initialization discussion.
+- [10] Goodfellow et al., "Maxout networks" (2013) — CIFAR-10 baseline in Table 6.
+- [11] He, Zhang, Ren & Sun, "Convolutional networks at constrained time cost" (2015) — corroborating prior report of degradation (§1).
+- [12] He, Zhang, Ren & Sun, "Spatial pyramid pooling in deep convs" (2014) — SPP/RoI feature pyramid; used in COCO multi-scale testing.
+- [14] Hinton et al., "Improving NNs by preventing co-adaptation (dropout)" (2012) — explicitly *not* used here (no dropout).
+- [15] Hochreiter & Schmidhuber, "Long short-term memory" (1997) — gating motivation referenced when contrasting with highway networks.
+- [17] Jegou, Douze & Schmid, "Product quantization for nearest neighbor" (TPAMI 2011) — residual-vector encoding precedent.
+- [19] Jia et al., "Caffe" (2014) — implementation library reference.
+- [20] Krizhevsky, "Multiple layers of features from tiny images" (2009) — CIFAR-10 dataset.
+- [21] Krizhevsky, Sutskever & Hinton, "ImageNet classification with deep CNNs" (NIPS 2012) — AlexNet, historical reference for color/scale augmentation and 10-crop testing.
+- [22] LeCun et al., "Backpropagation applied to handwritten zip code recognition" (1989) — historical reference for SGD with backprop.
+- [23] LeCun et al., "Efficient backprop" (1998) — historical reference for normalization.
+- [24] Lee et al., "Deeply-supervised nets (DSN)" (2014) — auxiliary classifier baseline; CIFAR-10 augmentation recipe (4-pixel pad + crop) borrowed.
+- [25] Lin, Chen & Yan, "Network in network (NIN)" (2013) — CIFAR-10 baseline in Table 6.
+- [27] Long, Shelhamer & Darrell, "Fully convolutional networks for semantic segmentation" (CVPR 2015) — referenced for fully-convolutional dense testing.
+- [28] Montufar, Pascanu, Cho & Bengio, "On the number of linear regions of deep NNs" (NIPS 2014) — cited at footnote 2 for the asymptotic-approximation hypothesis.
+- [29] Nair & Hinton, "ReLU" (ICML 2010) — activation function used throughout.
+- [30] Perronnin & Dance, "Fisher kernels on visual vocabularies" (CVPR 2007) — historical residual-encoding precedent.
+- [31] Raiko, Valpola & LeCun, "Deep learning made easier by linear transformations in perceptrons" (AISTATS 2012) — linear-shortcut precedent.
+- [33] Ren, He, Girshick & Sun, "Object detection networks on conv feature maps (NoC)" (arXiv:1504.06066) — see RW11.
+- [34] Ripley, *Pattern recognition and neural networks* (1996) — historical shortcut reference.
+- [35] Romero et al., "FitNets: Hints for thin deep nets" (ICLR 2015) — CIFAR-10 baseline in Table 6.
+- [37] Saxe, McClelland & Ganguli, "Exact solutions to the nonlinear dynamics of learning in deep linear NNs" (arXiv:1312.6120) — referenced for normalization discussion.
+- [38] Schraudolph, "Accelerated gradient descent by factor-centering decomposition" (1998) — precondition / centering precedent.
+- [39] Schraudolph, "Centering NN gradient factors" (1998) — see [38].
+- [40] Sermanet et al., "OverFeat" (ICLR 2014) — ImageNet localization baseline in Table 14.
+- [42] Srivastava, Greff & Schmidhuber, "Highway networks" (arXiv:1505.00387) — see RW01.
+- [43] Srivastava, Greff & Schmidhuber, "Training very deep networks" (arXiv:1507.06228) — see RW01.
+- [45] Szeliski, "Fast surface interpolation using hierarchical basis functions" (TPAMI 1990) — see RW09.
+- [46] Szeliski, "Locally adapted hierarchical basis preconditioning" (SIGGRAPH 2006) — see RW09.
+- [47] Vatanen, Raiko, Valpola & LeCun, "Pushing stochastic gradient towards second-order methods" (NIPS 2013) — second-order/preconditioning context.
+- [48] Vedaldi & Fulkerson, "VLFeat" (2008) — VLFeat library / VLAD support.
+- [49] Venables & Ripley, *Modern applied statistics with s-plus* (1999) — historical shortcut reference.
+- [50] Zeiler & Fergus, "Visualizing and understanding convolutional networks" (ECCV 2014) — referenced for "low/mid/high-level features" framing in §1.
@@ -0,0 +1,54 @@
+# Algorithm: Residual Learning
+
+## Mathematical formulation
+
+Let $\mathcal{H}(\mathbf{x})$ denote the underlying mapping a stack of layers should fit, with $\mathbf{x}$ the input to the first layer in the stack. Rather than fitting $\mathcal{H}$ directly, residual learning hypothesizes a residual function
+
+$$\mathcal{F}(\mathbf{x}) := \mathcal{H}(\mathbf{x}) - \mathbf{x}.$$
+
+The original mapping is recovered as $\mathcal{F}(\mathbf{x}) + \mathbf{x}$. A residual *building block* (Fig. 2) is
+
+$$\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x} \qquad (\text{Eqn. 1})$$
+
+When the input and output dimensions differ (e.g., on stride-2 down-sampling stages), a linear projection $W_s$ is introduced on the shortcut:
+
+$$\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + W_s \mathbf{x} \qquad (\text{Eqn. 2})$$
+
+For a 2-layer basic block, $\mathcal{F} = W_2 \, \sigma(W_1 \mathbf{x})$, where $\sigma$ is ReLU and biases are omitted. The element-wise add is followed by a second ReLU: $\sigma(\mathbf{y})$. For a 3-layer bottleneck block, $\mathcal{F} = W_3 \, \sigma(W_2 \, \sigma(W_1 \mathbf{x}))$, where $W_1, W_3$ are 1×1 convolutions and $W_2$ is 3×3.
+
+## Pseudocode (forward pass of one residual block)
+
+```
+function ResidualBlock(x, F, shortcut_type, downsample):
+    # F is the residual function (basic or bottleneck)
+    out = F(x)                        # conv -> BN -> ReLU -> conv -> BN
+                                      # (or 1x1 -> 3x3 -> 1x1 for bottleneck)
+    if dim(out) == dim(x) and not downsample:
+        identity = x                  # parameter-free identity shortcut
+    else:
+        if shortcut_type == "A":
+            identity = zero_pad_channels(maybe_stride2(x))   # option A
+        else:
+            identity = projection(x)  # option B / C: 1x1 conv (+stride 2)
+    out = out + identity              # element-wise add
+    return ReLU(out)
+```
+
+`maybe_stride2(x)` performs the 2× spatial down-sampling that matches the residual function's stride. For ImageNet ResNets, down-sampling occurs at the *first* block of stages `conv3_x`, `conv4_x`, `conv5_x`.
+
+The full network is a stage-by-stage stack of `ResidualBlock` instances (architecture.md), preceded by the 7×7 stem and followed by global average pool + fc + softmax.
+
+## Step-by-step explanation
+1. **Pre-process**: subtract per-pixel mean; standard color/scale augmentation; 224×224 random crop on a shorter-side ∈ [256, 480] image.
+2. **Stem**: 7×7 conv stride 2 → BN → ReLU → 3×3 max-pool stride 2.
+3. **Residual stages**: for each residual block, compute $\mathcal{F}(\mathbf{x})$ by 2 (basic) or 3 (bottleneck) conv layers each followed by BN; the second/third has no ReLU before the addition.
+4. **Shortcut add**: identity (option A) or projection (option B/C); element-wise sum.
+5. **Block output**: ReLU after the add.
+6. **Head**: global average pool over the 7×7 final map → 1000-way fc → softmax.
+
+## Complexity analysis
+- Identity shortcut adds **zero** parameters and only an element-wise addition (negligible vs. the conv FLOPs).
+- Projection shortcut adds parameters proportional to $C_{in} \cdot C_{out}$ for a 1×1 conv, with FLOPs of $C_{in} \cdot C_{out} \cdot H_{out} \cdot W_{out}$.
+- A bottleneck block with identity shortcut has the same time complexity as a basic 2-layer block (paper §"Deeper Bottleneck Architectures"); replacing its identity with a projection roughly *doubles* the block's time complexity and model size.
+- Whole-network FLOPs: ResNet-{18, 34, 50, 101, 152} = {1.8, 3.6, 3.8, 7.6, 11.3} × 10⁹ (Table 1). VGG-19 = 19.6 × 10⁹.
+- For comparison, the 34-layer plain net has 3.6 × 10⁹ FLOPs (≈18% of VGG-19) — adding identity shortcuts to make ResNet-34 leaves FLOPs unchanged.
@@ -0,0 +1,64 @@
+# Architecture
+
+The ImageNet ResNet family is a stage-structured CNN. Per Table 1 / Fig. 3, every variant
+shares the same five-stage skeleton; only the residual-block type and the per-stage block
+count change with depth.
+
+## Component graph
+
+### Stem (`conv1`)
+- **Inputs**: 224×224×3 image (per-pixel mean subtracted, ImageNet color/scale augmented).
+- **Operation**: 7×7 conv, 64 filters, stride 2 → BN → ReLU → 3×3 max-pool, stride 2.
+- **Outputs**: 56×56×64 feature map.
+- **Notes**: Identical for every ResNet-{18, 34, 50, 101, 152}.
+
+### Stage `conv2_x` (output 56×56)
+- **Inputs**: 56×56×64.
+- **Operation**: Stack of residual blocks. Basic block (ResNet-18/34): two 3×3 convs, 64 filters. Bottleneck (ResNet-50/101/152): 1×1, 64 → 3×3, 64 → 1×1, 256.
+- **Block counts**: ResNet-18 ×2, ResNet-34 ×3, ResNet-50/101/152 ×3.
+- **Outputs**: 56×56×64 (basic) or 56×56×256 (bottleneck).
+- **Notes**: First block of `conv2_x` does not down-sample.
+
+### Stages `conv3_x` (28×28), `conv4_x` (14×14), `conv5_x` (7×7)
+- **Inputs**: previous stage's feature map.
+- **Operation**: First block of each stage uses **stride 2** in the 3×3 conv to halve spatial size; channel count doubles. Subsequent blocks keep size and channels.
+- **Block counts (basic)**: ResNet-18 — {2, 2, 2}; ResNet-34 — {4, 6, 3}.
+- **Block counts (bottleneck)**: ResNet-50 — {4, 6, 3}; ResNet-101 — {4, 23, 3}; ResNet-152 — {8, 36, 3}.
+- **Channel widths (basic)**: 128 → 256 → 512.
+- **Channel widths (bottleneck)**: middle 128/256/512, output 512/1024/2048.
+- **Outputs**: 28×28×{128 | 512}, 14×14×{256 | 1024}, 7×7×{512 | 2048} respectively.
+
+### Residual block variants
+- **Basic block (ResNet-18/34)**: `[3×3 conv → BN → ReLU → 3×3 conv → BN] + shortcut → ReLU` (Fig. 2). Total 2 weighted layers.
+- **Bottleneck block (ResNet-50/101/152)**: `[1×1 conv → BN → ReLU → 3×3 conv → BN → ReLU → 1×1 conv → BN] + shortcut → ReLU` (Fig. 5 right). Total 3 weighted layers.
+- **Shortcut paths**:
+  - Identity (option A): direct add; for dimension-changing blocks, zero-pad extra channels and use stride-2 sampling.
+  - Projection (option B/C): 1×1 conv with stride 2 (when down-sampling) and matching output channels. Default in deeper bottleneck nets is option B (projection only on dimension changes).
+
+### Head
+- **Inputs**: 7×7×{512 | 2048} feature map.
+- **Operation**: Global average pooling → 1000-d fully-connected softmax.
+- **Outputs**: 1000-class probability vector.
+
+### CIFAR-10 architecture (separate variant)
+- 32×32×3 input → first 3×3 conv with 16 filters → three feature-map sizes {32, 16, 8} with widths {16, 32, 64} and 2n layers each (total 6n+2 weighted layers) → global average pool → 10-way fc + softmax.
+- Down-sampling done by stride-2 convolutions; **option A identity shortcuts everywhere**, so the residual nets have *exactly* the same parameter, depth, and width as the plain counterparts.
+
+## Per-depth complexity (from Table 1)
+
+| Depth | Block type | FLOPs | Block layout (conv2..conv5) |
+|-------|------------|-------|------------------------------|
+| 18  | basic       | 1.8×10⁹ | 2, 2, 2, 2 |
+| 34  | basic       | 3.6×10⁹ | 3, 4, 6, 3 |
+| 50  | bottleneck  | 3.8×10⁹ | 3, 4, 6, 3 |
+| 101 | bottleneck  | 7.6×10⁹ | 3, 4, 23, 3 |
+| 152 | bottleneck  | 11.3×10⁹ | 3, 8, 36, 3 |
+
+For reference: VGG-19 = 19.6×10⁹ FLOPs (Table 1 caption), so ResNet-152 is roughly **8× deeper** at **~57% of the FLOPs**.
+
+## Key design choices
+- **VGG-style design rules** (§3.3): same output map size ⇒ same #filters; halving spatial size ⇒ doubling #filters.
+- **Down-sampling by stride-2 conv**, not by pooling, in `conv3_1`, `conv4_1`, `conv5_1`.
+- **Single 3×3 conv per layer** in basic blocks; bottleneck reduces the high-dim 1×1 ↔ 3×3 cost.
+- **No dropout** anywhere (§3.4).
+- **BN after every conv, before activation** (§3.4).
@@ -0,0 +1,42 @@
+# Constraints and Limitations
+
+## Dimension-matching constraints
+- Eqn. (1) (`y = F(x) + x`) is only well-defined when `dim(F(x)) == dim(x)`. When stages change spatial size or channel count, one of two strategies is required:
+  - **Option A (identity + zero-pad)**: down-sample by stride-2 sampling on the shortcut and zero-pad the extra channels — *no learnable parameters*.
+  - **Option B/C (projection)**: 1×1 conv with stride 2 (when down-sampling), introducing parameters `C_in × C_out`.
+- Within a stage (no spatial or channel change), identity shortcuts are always used.
+
+## Shortcut-design caveats
+- **Single-layer F is rejected.** When `F` has only one weighted layer, Eqn. (1) is similar to a linear layer; the paper "did not observe advantages" and uses ≥2 layers in `F` (§3.1).
+- **Option A has a residual-learning blind spot at dimension changes.** Zero-padded extra channels carry "no residual learning" — the paper attributes the small Option B>A gap on ImageNet to this fact.
+- **Option C is not worth its cost.** Although marginally better than B (24.19 vs. 24.52 ResNet-34 top-1), C adds 13 extra projection shortcuts and is "not essential for addressing the degradation problem"; the paper rejects C "to reduce memory/time complexity and model sizes" (§"Identity vs. Projection Shortcuts").
+- **Bottleneck blocks particularly require identity shortcuts.** Replacing the bottleneck identity with a projection "doubles" both time complexity and model size.
+
+## Training-recipe constraints
+- **No dropout** anywhere (§3.4).
+- **BN must come right after each conv and before activation** (§3.4).
+- **MSRA initialization** for all conv weights.
+- **Mini-batch size**: 256 for ImageNet, 128 for CIFAR-10.
+- **LR schedule** depends on a *plateau* signal for ImageNet (LR /10 when error plateaus) but a *fixed* step schedule for CIFAR-10 (32k, 48k iters).
+
+## Optimization caveats specific to extreme depth
+- **Warmup is required for the 110-layer CIFAR ResNet.** LR 0.1 from iter 0 is "slightly too large to start converging"; the recipe is LR 0.01 for ~400 iters until training error <80%, then restore LR 0.1 (footnote 5). LR 0.1 from start *eventually* converges to similar accuracy after several epochs of >90% error, but warmup is the chosen recipe.
+- **No warmup is needed for ResNet-1202** at the same recipe — the paper notes "no optimization difficulty" for the 1202-layer model.
+- **Overfitting bites at extreme depth on small data.** ResNet-1202 (19.4M params on 50k CIFAR images) trains fine but tests at 7.93% vs. ResNet-110's 6.43%; the paper does not apply maxout/dropout/strong regularization in this work.
+
+## Generalization caveats
+- All ImageNet results use 10-crop testing for Tables 2/3 and an additional fully-convolutional multi-scale test for Table 4. Mixing these protocols across rows is not legitimate.
+- The 3.57% top-5 ensemble result (Table 5) is a 6-model ensemble on the test set — not reproducible from a single model.
+
+## Detection-transfer caveats (Appendix A)
+- BN layer statistics are *frozen* during Faster R-CNN fine-tuning to reduce memory consumption.
+- Per-class RPN with binary logistic classification is used for ImageNet localization; for COCO/PASCAL detection, the standard category-agnostic RPN is used.
+- The COCO+PASCAL "baseline+++" rows in Tables 9/10/11 combine box refinement, global context, multi-scale testing, and ensembling on top of ResNet-101; these gains are not solely attributable to the backbone.
+
+## Unverified hypothesis
+- **A1 (open question)**: The paper's residual-learning argument relies on the hypothesis that stacked nonlinear layers can asymptotically approximate any function (and thus also any residual function). Footnote 2 explicitly notes this remains open.
+
+## Out-of-scope
+- The paper does not formally prove that residual learning improves *optimization landscape* properties (it proposes this as a hypothesis supported empirically by Fig. 7).
+- The paper does not study extremely *wide* networks; only depth scaling.
+- No comparison with second-order optimizers; SGD is the only solver studied.
@@ -0,0 +1,43 @@
+# Heuristics
+
+## H01: Default to identity shortcuts (Option A) for parameter-free residual learning
+- **Rationale**: Identity shortcuts add zero parameters and zero FLOPs beyond an element-wise add, so they cleanly isolate the effect of residual learning from increases in capacity. Empirically, A is within ~0.65 top-1 of the more expensive C on ResNet-34 (Table 3).
+- **Sensitivity**: low — the residual learning gain is dominated by *having* a shortcut, not by which kind.
+- **Bounds**: Use Option A only when input/output dimensions match (or fall back to zero-padding for new channels). For deeper bottleneck nets, default to Option B (projection only on dimension changes) to avoid the "no residual learning at the dimension change" blind spot of A.
+- **Code ref**: [src/execution/residual_block.py](../../src/execution/residual_block.py)
+- **Source**: §3.2; §"Identity vs. Projection Shortcuts"; Table 3.
+
+## H02: Place BN right after every conv and before the activation, no dropout
+- **Rationale**: BN keeps forward-signal variance non-zero in deep stacks (rules out vanishing-signal explanations for any residual dynamics). Combining BN with no dropout simplifies the picture and lets the depth ablation be clean.
+- **Sensitivity**: medium — the paper consistently reports degraded plain-net results without BN. BN ↔ activation order is held fixed at "BN before ReLU" throughout.
+- **Bounds**: Applies to ImageNet and CIFAR-10 training. For Faster R-CNN fine-tuning (Appendix A), BN statistics are *frozen* (BN behaves as an affine transform) to save memory.
+- **Code ref**: [src/execution/residual_block.py](../../src/execution/residual_block.py)
+- **Source**: §3.4 "Implementation"; Appendix A.
+
+## H03: Warm up the LR for the 110-layer CIFAR ResNet
+- **Rationale**: At depth 110, LR 0.1 from iter 0 is "slightly too large to start converging" cleanly. Pre-warming at LR 0.01 for ~400 iterations until training error drops below ~80% lets the optimizer enter a basin where LR 0.1 then trains stably.
+- **Sensitivity**: medium for the 110-layer CIFAR variant (controls whether early training stalls); low for ResNet-1202 (the paper notes no optimization difficulty there).
+- **Bounds**: Trigger only when very deep CIFAR ResNets fail to start converging at the default LR. The paper notes that LR 0.1 from start eventually reaches similar accuracy "after several epochs (about 90% error)" — warmup is a stability heuristic, not a fundamental requirement.
+- **Code ref**: [src/execution/training_recipe.py](../../src/execution/training_recipe.py)
+- **Source**: §4.2 paragraph on n=18; footnote 5.
+
+## H04: Use bottleneck blocks (1×1 → 3×3 → 1×1) once depth exceeds ~50 layers
+- **Rationale**: A 3-layer bottleneck block has the same per-block time complexity as a 2-layer 3×3 block but lets the 3×3 operate on a low-dimensional bottleneck. This makes 50/101/152-layer ResNets tractable at FLOPs comparable to a 34-layer non-bottleneck (3.8 vs. 3.6 GFLOPs at 50 layers).
+- **Sensitivity**: high — at large depths, dropping bottlenecks would substantially raise compute and memory.
+- **Bounds**: Use only with identity shortcuts on the high-dimensional ends; replacing those identities with projections doubles complexity and model size.
+- **Code ref**: [src/execution/residual_block.py](../../src/execution/residual_block.py)
+- **Source**: §"Deeper Bottleneck Architectures"; Fig. 5; Table 1.
+
+## H05: Down-sample by stride-2 convolutions, not pooling
+- **Rationale**: Putting the stride on the first convolution of each stage (`conv3_1`, `conv4_1`, `conv5_1`) folds spatial reduction into a learnable layer and matches the VGG-style "halve resolution ⇒ double channels" rule, keeping per-layer time complexity roughly constant across stages.
+- **Sensitivity**: low — a design convention rather than a tuned trick; ResNets are not reported to be sensitive to swapping pooling for strided conv at these stages.
+- **Bounds**: Applies to the residual stages on ImageNet (and the analogous 32×32 → 16×16 → 8×8 progression on CIFAR-10). The 3×3 max-pool stride-2 in `conv1` is the only pooling used.
+- **Code ref**: [src/execution/residual_block.py](../../src/execution/residual_block.py)
+- **Source**: §3.3 design rules; Table 1.
+
+## H06: Match shortcut down-sampling to the residual function's stride
+- **Rationale**: When the residual function does stride-2 down-sampling, the shortcut must do the same — by stride-2 sampling (Option A) or stride-2 1×1 conv (Option B/C). Otherwise the element-wise add fails on shape, or worse, silently mis-aligns features.
+- **Sensitivity**: high — wrong shortcut stride breaks the block.
+- **Bounds**: Only triggers at stage boundaries (`conv3_1`, `conv4_1`, `conv5_1`).
+- **Code ref**: [src/execution/residual_block.py](../../src/execution/residual_block.py)
+- **Source**: §"Residual Network" paragraph in §3.3 ("when the shortcuts go across feature maps of two sizes, they are performed with a stride of 2").
@@ -0,0 +1,47 @@
+# Concrete config for ResNet-34 (option A) on ImageNet 2012.
+# Source: §3.3, §3.4, Table 1, Table 2 of the paper.
+model:
+  name: resnet34
+  block: basic            # two 3x3 convs per residual block
+  shortcut_option: A      # identity + zero-padding for dim changes (parameter-free)
+  stem:
+    conv: {kernel: 7, stride: 2, out_channels: 64}
+    pool: {kernel: 3, stride: 2, type: max}
+  stages:                 # block counts and base widths per stage
+    - {name: conv2_x, blocks: 3, width: 64,  downsample: false}
+    - {name: conv3_x, blocks: 4, width: 128, downsample: true}
+    - {name: conv4_x, blocks: 6, width: 256, downsample: true}
+    - {name: conv5_x, blocks: 3, width: 512, downsample: true}
+  head: {pool: global_avg, fc: 1000, softmax: true}
+  flops: 3.6e9            # Table 1
+  params_match_plain34: true
+
+train:
+  dataset: imagenet2012
+  augmentation:
+    short_side_range: [256, 480]
+    crop: 224
+    horizontal_flip: true
+    color_aug: krizhevsky_2012
+    pixel_mean_subtraction: true
+  optimizer:
+    type: sgd
+    momentum: 0.9
+    weight_decay: 1.0e-4
+  batch_size: 256
+  init: msra              # He et al. 2015, ref [13]
+  lr_schedule:
+    initial: 0.1
+    rule: divide_by_10_on_plateau
+  max_iter: 600000        # "up to 60 x 10^4"
+  bn:
+    placement: after_conv_before_relu
+    momentum: not_specified_in_paper
+  dropout: 0.0            # not used
+
+eval:
+  protocol: 10_crop       # for Tables 2 and 3
+  alt_protocol_table4: fully_convolutional_multi_scale
+  fully_conv_scales: [224, 256, 384, 480, 640]
+  reported_top1: 25.03    # Table 2 (10-crop, validation)
+  reported_top5: 7.76     # Table 3 row "ResNet-34 A" (10-crop, validation)
@@ -0,0 +1,54 @@
+# Model Configuration
+
+All depths share the same five-stage skeleton (Table 1 / Fig. 3).
+
+## Stem (`conv1`)
+- **Value**: 7×7 conv, 64 filters, stride 2 → BN → ReLU → 3×3 max-pool, stride 2.
+- **Rationale**: Standard VGG-derived stem; fixed across ResNet-{18, 34, 50, 101, 152}.
+- **Source**: Table 1; §3.3 / §3.4.
+
+## ResNet-18 — Block layout
+- **Value**: Basic block (two 3×3 convs); per-stage block counts {2, 2, 2, 2}; widths {64, 128, 256, 512}; FLOPs 1.8×10⁹.
+- **Source**: Table 1.
+
+## ResNet-34 — Block layout
+- **Value**: Basic block; per-stage counts {3, 4, 6, 3}; widths {64, 128, 256, 512}; FLOPs 3.6×10⁹.
+- **Rationale**: Designed to match the FLOPs of plain-34 (3.6×10⁹ ≈ 18% of VGG-19's 19.6×10⁹).
+- **Source**: Table 1; §3.3.
+
+## ResNet-50 — Block layout
+- **Value**: Bottleneck block (1×1 → 3×3 → 1×1); per-stage counts {3, 4, 6, 3}; bottleneck widths {64, 128, 256, 512}, output widths {256, 512, 1024, 2048}; FLOPs 3.8×10⁹.
+- **Rationale**: Replace each 2-layer block in ResNet-34 with a 3-layer bottleneck of comparable per-block time complexity.
+- **Source**: Table 1; §"Deeper Bottleneck Architectures".
+
+## ResNet-101 — Block layout
+- **Value**: Bottleneck; per-stage counts {3, 4, 23, 3}; FLOPs 7.6×10⁹.
+- **Source**: Table 1.
+
+## ResNet-152 — Block layout
+- **Value**: Bottleneck; per-stage counts {3, 8, 36, 3}; FLOPs 11.3×10⁹.
+- **Rationale**: Deepest single model; still 57% of VGG-19's FLOPs.
+- **Source**: Table 1; §4.1.
+
+## Shortcut option
+- **Value**:
+  - Tables 2, Fig. 4 (right), Fig. 6: Option A (identity + zero-pad) for both 18- and 34-layer ResNets.
+  - Table 3, Tables 4–5 deeper nets: Option B (projection only on dimension changes).
+  - Option C (projection on every shortcut) is studied as ablation only and not used in deeper models.
+- **Rationale**: A is parameter-free and isolates residual-learning gain; B is preferred at depth because dimension-change shortcuts are rarer and projections elsewhere are expensive in bottleneck blocks.
+- **Sensitivity**: low (≤0.65 top-1 difference between A/B/C on ResNet-34).
+- **Source**: §3.3 "Residual Network"; §"Identity vs. Projection Shortcuts"; Table 3.
+
+## Activation
+- **Value**: ReLU after every (conv, BN) pair, including a final ReLU after the residual sum (Fig. 2).
+- **Source**: §3.2 — "We adopt the second nonlinearity after the addition (i.e., σ(y))."
+
+## Head
+- **Value**: Global average pooling → 1000-way fc → softmax (ImageNet); 10-way fc → softmax (CIFAR-10).
+- **Source**: §3.3.
+
+## CIFAR-10 architecture
+- **Value**: First layer 3×3 conv, 16 filters, then `2n` layers each at three feature-map sizes {32×32, 16×16, 8×8} with widths {16, 32, 64}. Total weighted layers = `6n + 2`. Down-sampling by stride-2 conv. Identity shortcuts everywhere (Option A) → ResNets have *exactly* the same depth/width/parameters as plain counterparts.
+- **Studied n**: {3, 5, 7, 9, 18, 200} ⇒ depths {20, 32, 44, 56, 110, 1202}.
+- **Param counts (Table 6)**: 0.27M, 0.46M, 0.66M, 0.85M, 1.7M, 19.4M for ResNets at those depths.
+- **Source**: §4.2; Table 6.
@@ -0,0 +1,114 @@
+# Training Configuration
+
+ImageNet recipe (§3.4 "Implementation"); CIFAR-10 recipe (§4.2).
+
+## ImageNet — Mini-batch size
+- **Value**: 256
+- **Rationale**: Standard ILSVRC-era batch for SGD with momentum on multi-GPU setups.
+- **Search range**: Not specified in paper.
+- **Sensitivity**: low (within typical ImageNet ranges).
+- **Source**: §3.4.
+
+## ImageNet — Initial learning rate
+- **Value**: 0.1
+- **Rationale**: SGD baseline with BN; divided by 10 when error plateaus.
+- **Search range**: Not specified in paper.
+- **Sensitivity**: medium — for the 110-layer CIFAR ResNet, 0.1 from iter 0 fails to start converging cleanly (see warmup).
+- **Source**: §3.4.
+
+## ImageNet — LR schedule
+- **Value**: Step (÷10 on validation-error plateau)
+- **Rationale**: Standard schedule for the era; trains for up to 60×10⁴ iterations.
+- **Search range**: Not specified.
+- **Sensitivity**: low.
+- **Source**: §3.4.
+
+## ImageNet — Total iterations
+- **Value**: up to 60×10⁴
+- **Rationale**: Continue training through plateau-triggered LR drops.
+- **Search range**: Not specified.
+- **Sensitivity**: low.
+- **Source**: §3.4.
+
+## ImageNet — Momentum
+- **Value**: 0.9
+- **Rationale**: Standard SGD momentum.
+- **Sensitivity**: low.
+- **Source**: §3.4.
+
+## ImageNet — Weight decay
+- **Value**: 0.0001 (= 1e-4)
+- **Rationale**: Standard L2 regularization.
+- **Sensitivity**: low.
+- **Source**: §3.4.
+
+## ImageNet — Dropout
+- **Value**: not used (rate = 0)
+- **Rationale**: BN replaces the regularization role of dropout in this recipe (following BN paper convention).
+- **Sensitivity**: low (BN-paper convention).
+- **Source**: §3.4 ("We do not use dropout, following the practice in [16].").
+
+## ImageNet — Initialization
+- **Value**: MSRA (He et al., 2015 — ref [13])
+- **Rationale**: Variance-preserving init for ReLU networks; consistent across plain and residual nets.
+- **Source**: §3.4 ("We initialize the weights as in [13]").
+
+## ImageNet — Data augmentation
+- **Value**: Image resized with shorter side ∈ [256, 480], 224×224 random crop with horizontal flip; standard color augmentation per Krizhevsky et al. (ref [21]); per-pixel mean subtraction.
+- **Rationale**: Scale augmentation per VGG (ref [41]); color augmentation per AlexNet.
+- **Sensitivity**: medium — standard ImageNet augmentation suite.
+- **Source**: §3.4.
+
+## ImageNet — Test-time evaluation
+- **Value**: 10-crop testing for Tables 2/3; fully-convolutional multi-scale testing at scales {224, 256, 384, 480, 640} for Table 4.
+- **Rationale**: Match prior comparisons (10-crop); push best single-model with multi-scale fully-convolutional inference.
+- **Sensitivity**: medium for absolute error, low for relative comparisons.
+- **Source**: §3.4 ("In testing").
+
+## CIFAR-10 — Mini-batch size
+- **Value**: 128
+- **Rationale**: Standard CIFAR mini-batch.
+- **Source**: §4.2.
+
+## CIFAR-10 — Initial learning rate
+- **Value**: 0.1 (with warmup at 0.01 for ~400 iters when n=18, i.e., 110 layers)
+- **Rationale**: 0.1 alone is "slightly too large to start converging" at 110 layers; warmup allows clean convergence.
+- **Sensitivity**: medium (only at extreme depth on CIFAR).
+- **Source**: §4.2; footnote 5.
+
+## CIFAR-10 — LR schedule
+- **Value**: ÷10 at iter 32k and ÷10 again at iter 48k
+- **Rationale**: Fixed step schedule (vs. plateau-triggered on ImageNet).
+- **Source**: §4.2.
+
+## CIFAR-10 — Total iterations
+- **Value**: 64,000
+- **Rationale**: Determined on a 45k/5k train/val split.
+- **Source**: §4.2.
+
+## CIFAR-10 — Momentum
+- **Value**: 0.9
+- **Source**: §4.2.
+
+## CIFAR-10 — Weight decay
+- **Value**: 0.0001
+- **Source**: §4.2.
+
+## CIFAR-10 — Dropout
+- **Value**: not used.
+- **Source**: §4.2.
+
+## CIFAR-10 — Data augmentation
+- **Value**: 4-pixel padding on each side, random 32×32 crop from the padded image (or its horizontal flip); per-pixel mean subtraction. For testing, only the original 32×32 image is evaluated.
+- **Rationale**: Augmentation recipe from Lee et al. (DSN, ref [24]).
+- **Source**: §4.2.
+
+## CIFAR-10 — GPUs
+- **Value**: 2
+- **Source**: §4.2.
+
+## Detection — Faster R-CNN fine-tuning LR
+- **Value**: 0.001 for 240k iterations, then 0.0001 for 80k iterations.
+- **Rationale**: Standard COCO Faster R-CNN schedule; mini-batch 8 (RPN step) / 16 (Fast R-CNN step) on 8 GPUs.
+- **Sensitivity**: low (matches reference Faster R-CNN).
+- **Source**: Appendix A.
@@ -0,0 +1,28 @@
+# Environment
+
+## Software
+- **Python**: Not specified in paper (paper is from 2015 — likely Python 2.7 era).
+- **Framework**: Not specified in paper. The reference implementation [19] is Caffe (Jia et al., 2014); the paper does not commit to a specific framework. Modern reproductions typically use PyTorch ≥1.5 or TensorFlow ≥2.x.
+- **Inference protocol libraries**: Standard Caffe data layers for color/scale augmentation; per-pixel mean subtraction.
+
+## Hardware
+- **CIFAR-10**: 2 GPUs (§4.2). GPU model not specified.
+- **ImageNet**: GPU count not specified in §3.4. Mini-batch size 256 implies a multi-GPU setup; commonly 8 GPUs at the time.
+- **Detection (Faster R-CNN, Appendix A)**: 8 GPUs; mini-batch 8 images per RPN step (1/GPU), 16 images per Fast R-CNN step.
+
+## Key dependencies (inferred — not enumerated in paper)
+- Standard CNN training stack of the era (Caffe + cuDNN 4/5).
+- For modern reproduction: torch / torchvision / numpy.
+
+## Random seeds
+- Not specified in paper. The 110-layer CIFAR ResNet is reported as **5 runs with mean ± std** ("best (mean ± std)" in Table 6 caption: 6.43% best, 6.61 ± 0.16% mean ± std), so authors do average over seeds for that depth.
+
+## Data
+- **ImageNet 2012**: 1.28M train / 50k val / 100k test images (1000 classes).
+- **CIFAR-10**: 50k train / 10k test images (10 classes), 32×32. 45k/5k train/val split used to determine the 64k iteration budget.
+- **PASCAL VOC 2007 / 2012**: trainval splits "07+12" (5k+16k) or "07++12" (5k+16k+10k VOC07 trainval+test).
+- **MS COCO**: 80k train + 40k val (Appendix B).
+
+## Reproduction notes
+- The paper does not release a code drop in the document itself; modern reference implementations live in `torchvision.models.resnet`.
+- BN momentum, exact MSRA fan-mode, learning-rate decay's "plateau" trigger threshold are unspecified; reproductions typically use BN momentum 0.1 (PyTorch default) and a manual step schedule at iters {300k, 600k} or based on epoch count.
@@ -0,0 +1,148 @@
+"""Residual building blocks from He et al. 2015.
+
+Implements the basic 2-layer block (Fig. 2) and the 3-layer bottleneck block
+(Fig. 5 right) with identity (Option A) and projection (Option B/C) shortcuts.
+
+Only the novel residual-block contribution is shown here: stems, stage stacking,
+and heads are deferred to a higher-level model file.
+"""
+
+from typing import Literal, Optional
+
+import torch
+from torch import Tensor, nn
+
+
+ShortcutOption = Literal["A", "B", "C"]
+
+
+def _conv3x3(in_c: int, out_c: int, stride: int = 1) -> nn.Conv2d:
+    return nn.Conv2d(in_c, out_c, kernel_size=3, stride=stride, padding=1, bias=False)
+
+
+def _conv1x1(in_c: int, out_c: int, stride: int = 1) -> nn.Conv2d:
+    return nn.Conv2d(in_c, out_c, kernel_size=1, stride=stride, bias=False)
+
+
+class _IdentityWithZeroPad(nn.Module):
+    """Option A shortcut: stride-2 sample on spatial axes, zero-pad new channels."""
+
+    def __init__(self, in_c: int, out_c: int, stride: int) -> None:
+        super().__init__()
+        if out_c < in_c:
+            raise ValueError("Option A only widens channels; never narrows them.")
+        self.stride = stride
+        self.extra = out_c - in_c
+
+    def forward(self, x: Tensor) -> Tensor:
+        if self.stride > 1:
+            x = x[:, :, :: self.stride, :: self.stride]
+        if self.extra > 0:
+            pad = x.new_zeros(x.size(0), self.extra, x.size(2), x.size(3))
+            x = torch.cat([x, pad], dim=1)
+        return x
+
+
+def _build_shortcut(
+    in_c: int,
+    out_c: int,
+    stride: int,
+    option: ShortcutOption,
+) -> nn.Module:
+    if in_c == out_c and stride == 1 and option != "C":
+        return nn.Identity()
+    if option == "A":
+        return _IdentityWithZeroPad(in_c, out_c, stride)
+    return nn.Sequential(_conv1x1(in_c, out_c, stride), nn.BatchNorm2d(out_c))
+
+
+class BasicBlock(nn.Module):
+    """ResNet basic block — used in ResNet-18 / ResNet-34."""
+
+    expansion: int = 1
+
+    def __init__(
+        self,
+        in_channels: int,
+        channels: int,
+        stride: int = 1,
+        shortcut: ShortcutOption = "A",
+    ) -> None:
+        super().__init__()
+        out_channels = channels * self.expansion
+        self.conv1 = _conv3x3(in_channels, channels, stride=stride)
+        self.bn1 = nn.BatchNorm2d(channels)
+        self.conv2 = _conv3x3(channels, out_channels, stride=1)
+        self.bn2 = nn.BatchNorm2d(out_channels)
+        self.relu = nn.ReLU(inplace=True)
+        self.shortcut = _build_shortcut(in_channels, out_channels, stride, shortcut)
+
+    def forward(self, x: Tensor) -> Tensor:
+        out = self.relu(self.bn1(self.conv1(x)))
+        out = self.bn2(self.conv2(out))
+        out = out + self.shortcut(x)
+        return self.relu(out)
+
+
+class BottleneckBlock(nn.Module):
+    """ResNet bottleneck block — used in ResNet-50 / ResNet-101 / ResNet-152.
+
+    Layout: 1x1 (reduce) -> 3x3 -> 1x1 (restore). The expansion factor 4 means
+    the output of the block is 4 * `channels` channels deep (Fig. 5 right).
+    """
+
+    expansion: int = 4
+
+    def __init__(
+        self,
+        in_channels: int,
+        channels: int,
+        stride: int = 1,
+        shortcut: ShortcutOption = "B",
+    ) -> None:
+        super().__init__()
+        out_channels = channels * self.expansion
+        self.conv1 = _conv1x1(in_channels, channels)
+        self.bn1 = nn.BatchNorm2d(channels)
+        # Stride lives on the 3x3 conv — matches the original He et al. design.
+        self.conv2 = _conv3x3(channels, channels, stride=stride)
+        self.bn2 = nn.BatchNorm2d(channels)
+        self.conv3 = _conv1x1(channels, out_channels)
+        self.bn3 = nn.BatchNorm2d(out_channels)
+        self.relu = nn.ReLU(inplace=True)
+        self.shortcut = _build_shortcut(in_channels, out_channels, stride, shortcut)
+
+    def forward(self, x: Tensor) -> Tensor:
+        out = self.relu(self.bn1(self.conv1(x)))
+        out = self.relu(self.bn2(self.conv2(out)))
+        out = self.bn3(self.conv3(out))
+        out = out + self.shortcut(x)
+        return self.relu(out)
+
+
+def make_stage(
+    block: type[nn.Module],
+    in_channels: int,
+    channels: int,
+    blocks: int,
+    stride: int,
+    shortcut: ShortcutOption,
+) -> nn.Sequential:
+    """Stack `blocks` residual blocks; first block handles down-sampling."""
+    layers: list[nn.Module] = [
+        block(in_channels, channels, stride=stride, shortcut=shortcut)
+    ]
+    in_c = channels * block.expansion  # type: ignore[attr-defined]
+    for _ in range(1, blocks):
+        layers.append(block(in_c, channels, stride=1, shortcut=shortcut))
+    return nn.Sequential(*layers)
+
+
+# Per-depth stage layouts from Table 1.
+RESNET_LAYOUTS: dict[str, tuple[type[nn.Module], tuple[int, int, int, int]]] = {
+    "resnet18": (BasicBlock, (2, 2, 2, 2)),
+    "resnet34": (BasicBlock, (3, 4, 6, 3)),
+    "resnet50": (BottleneckBlock, (3, 4, 6, 3)),
+    "resnet101": (BottleneckBlock, (3, 4, 23, 3)),
+    "resnet152": (BottleneckBlock, (3, 8, 36, 3)),
+}
@@ -0,0 +1,109 @@
+"""Training recipe for ImageNet and CIFAR-10 ResNets (He et al. 2015, §3.4 & §4.2).
+
+This file captures the *recipe*, not a runnable trainer: optimizer construction,
+LR schedule (including the 110-layer CIFAR warmup), and the BN-after-conv
+convention. Higher-level data loaders, distributed wrappers, and logging are
+intentionally omitted.
+"""
+
+from dataclasses import dataclass
+from typing import Iterable
+
+import torch
+from torch import nn
+from torch.optim import SGD
+
+
+@dataclass(frozen=True)
+class ImageNetRecipe:
+    """ImageNet recipe from §3.4."""
+
+    batch_size: int = 256
+    initial_lr: float = 0.1
+    momentum: float = 0.9
+    weight_decay: float = 1e-4
+    max_iter: int = 600_000          # "up to 60 x 10^4"
+    lr_decay_factor: float = 0.1     # divide by 10 on plateau
+    use_dropout: bool = False        # explicitly off (BN replaces it)
+
+
+@dataclass(frozen=True)
+class CifarRecipe:
+    """CIFAR-10 recipe from §4.2."""
+
+    batch_size: int = 128
+    initial_lr: float = 0.1
+    momentum: float = 0.9
+    weight_decay: float = 1e-4
+    max_iter: int = 64_000
+    lr_drop_iters: tuple[int, int] = (32_000, 48_000)
+    lr_decay_factor: float = 0.1
+    # Warmup applies only to the 110-layer (n=18) ResNet (footnote 5):
+    warmup_lr: float = 0.01
+    warmup_until_train_err: float = 0.80   # restore initial_lr when train err < 80%
+
+
+def build_optimizer(
+    params: Iterable[nn.Parameter],
+    recipe: ImageNetRecipe | CifarRecipe,
+) -> SGD:
+    """SGD with momentum + weight decay, matching the paper."""
+    return SGD(
+        params,
+        lr=recipe.initial_lr,
+        momentum=recipe.momentum,
+        weight_decay=recipe.weight_decay,
+    )
+
+
+def imagenet_lr(it: int, current_lr: float, plateau: bool, recipe: ImageNetRecipe) -> float:
+    """ImageNet LR rule: drop by 10x whenever validation error plateaus.
+
+    The caller supplies the plateau signal — typically derived from a moving
+    average of validation error.
+    """
+    if plateau:
+        return current_lr * recipe.lr_decay_factor
+    return current_lr
+
+
+def cifar_lr(
+    it: int,
+    train_err: float,
+    needs_warmup: bool,
+    recipe: CifarRecipe,
+) -> float:
+    """Fixed step schedule for CIFAR-10 with optional 110-layer warmup.
+
+    Args:
+        it: current iteration (0-indexed).
+        train_err: latest training error in [0, 1].
+        needs_warmup: True only for the 110-layer (n=18) ResNet (§4.2, footnote 5).
+        recipe: CIFAR recipe.
+    """
+    if needs_warmup and train_err >= recipe.warmup_until_train_err:
+        return recipe.warmup_lr
+    lr = recipe.initial_lr
+    for drop_iter in recipe.lr_drop_iters:
+        if it >= drop_iter:
+            lr *= recipe.lr_decay_factor
+    return lr
+
+
+def msra_init(module: nn.Module) -> None:
+    """MSRA / Kaiming-normal init for conv weights (He et al. 2015, ref [13])."""
+    for m in module.modules():
+        if isinstance(m, nn.Conv2d):
+            nn.init.kaiming_normal_(m.weight, mode="fan_out", nonlinearity="relu")
+        elif isinstance(m, nn.BatchNorm2d):
+            nn.init.constant_(m.weight, 1.0)
+            nn.init.constant_(m.bias, 0.0)
+
+
+def freeze_bn_stats(model: nn.Module) -> None:
+    """Freeze BN running stats — used for Faster R-CNN fine-tuning (Appendix A)."""
+    for m in model.modules():
+        if isinstance(m, nn.BatchNorm2d):
+            m.eval()
+            for p in m.parameters():
+                p.requires_grad = False
@@ -0,0 +1,195 @@
+# Exploration tree for "Deep Residual Learning for Image Recognition" (He et al., 2015).
+#
+# This tree is a *reconstruction* of the paper's reasoning, not a literal session log.
+# Nodes derived directly from the paper's figures, tables, and text are marked
+# `support_level: explicit` with `source_refs`. Nodes that bridge the narrative
+# (e.g., "the authors then chose to study X") are marked `support_level: inferred`.
+
+tree:
+  - id: N01
+    type: question
+    support_level: explicit
+    source_refs: ["§1 introduction", "Fig. 1"]
+    title: "Is learning better networks as easy as stacking more layers?"
+    description: >-
+      Central research question that frames the entire paper. The introduction
+      explicitly poses this question and uses CIFAR-10 plain-{20, 56} curves
+      (Fig. 1) to motivate why the answer is "no, not in the obvious way."
+    children:
+
+      - id: N02
+        type: experiment
+        support_level: explicit
+        source_refs: ["Fig. 1", "Fig. 6 left"]
+        title: "Plain-net depth scan on CIFAR-10 (depths 20–110)"
+        description: >-
+          Train plain CNNs of depth {20, 32, 44, 56, 110} on CIFAR-10 with
+          BN + MSRA init. Observe whether deeper plain nets improve.
+        result: >-
+          Deeper plain nets degrade. Plain-110 fails badly (>60% error,
+          not displayed in Fig. 6). Plain-56 is worse than plain-20.
+
+      - id: N03
+        type: experiment
+        support_level: explicit
+        source_refs: ["Table 2", "Fig. 4 left"]
+        title: "Plain-net depth scan on ImageNet (18 vs. 34)"
+        description: >-
+          Train plain-{18, 34} on ImageNet with the standard recipe (§3.4).
+        result: >-
+          plain-34 = 28.54% top-1, plain-18 = 27.94% top-1 → degradation
+          confirmed on a large dataset, not specific to CIFAR.
+
+      - id: N04
+        type: dead_end
+        support_level: explicit
+        source_refs: ["§4.1 plain networks discussion"]
+        title: "Vanishing-gradient hypothesis"
+        description: >-
+          A natural first explanation for plain-net degradation. The authors
+          rule it out by noting BN keeps forward signals at non-zero variance
+          and that backward gradient norms are healthy.
+        why_failed: >-
+          With BN in place, signals do not vanish; the plain-34 net still
+          reaches competitive accuracy (24.19% top-1 with 10-crop, Table 3),
+          showing SGD is making progress — just slowly.
+
+      - id: N05
+        type: dead_end
+        support_level: inferred
+        source_refs: ["§4.1 (footnote 3)"]
+        title: "More training iterations alone (3×) on plain nets"
+        description: >-
+          The footnote reports that the authors tried training plain nets
+          for 3× more iterations and still observed the degradation problem.
+        why_failed: >-
+          The degradation is not just slow optimization; longer training
+          does not close the gap.
+
+      - id: N06
+        type: insight
+        support_level: explicit
+        source_refs: ["§1", "§3.1"]
+        title: "Reformulate layers to fit a residual mapping F(x) := H(x) - x"
+        description: >-
+          Key creative leap. Identity mapping should be expressible cheaply,
+          so let stacked nonlinear layers learn the *residual* from identity.
+        children:
+
+          - id: N07
+            type: decision
+            support_level: explicit
+            source_refs: ["§3.2", "Fig. 2"]
+            title: "Use parameter-free identity shortcuts (Eqn. 1)"
+            description: >-
+              Decision to add x via an unparameterized skip connection,
+              rather than a gated or projected one (contrasts with highway
+              networks, RW01).
+            rationale: >-
+              No parameters, no FLOPs added, lets the deeper variant exactly
+              match the shallower one's parameter budget for clean ablation.
+
+          - id: N08
+            type: experiment
+            support_level: explicit
+            source_refs: ["Table 2", "Fig. 4 right"]
+            title: "Plain vs. ResNet at 18/34 layers on ImageNet"
+            description: >-
+              Train ResNet-{18, 34} (option A shortcuts) under the same
+              recipe as plain-{18, 34}.
+            result: >-
+              ResNet-34 = 25.03% top-1 (3.51 pts better than plain-34, 2.85
+              pts better than ResNet-18). Degradation is removed.
+
+          - id: N09
+            type: experiment
+            support_level: explicit
+            source_refs: ["Table 3", "§\"Identity vs. Projection Shortcuts\""]
+            title: "Shortcut option ablation on ResNet-34 (A vs. B vs. C)"
+            description: >-
+              Ablation comparing parameter-free identity (A), projection on
+              dimension changes only (B), and projection on every shortcut (C).
+            result: >-
+              A = 25.03, B = 24.52, C = 24.19 top-1. Differences are small;
+              option C is rejected as not worth the parameter / memory cost.
+
+          - id: N10
+            type: dead_end
+            support_level: explicit
+            source_refs: ["§3.2"]
+            title: "Single-layer residual function F"
+            description: >-
+              Authors note Eqn. (1) with a single-layer F is similar to a
+              linear layer.
+            why_failed: >-
+              "We have not observed advantages." Authors require F to have
+              ≥2 weighted layers.
+
+          - id: N11
+            type: decision
+            support_level: explicit
+            source_refs: ["§\"Deeper Bottleneck Architectures\"", "Fig. 5 right"]
+            title: "Adopt 1×1 → 3×3 → 1×1 bottleneck blocks for ≥50 layers"
+            description: >-
+              Replace the basic 2-layer block with a 3-layer bottleneck of
+              comparable per-block time complexity.
+            rationale: >-
+              Lets us push depth to 50/101/152 layers at FLOPs comparable to
+              ResNet-34 (3.8 / 7.6 / 11.3 GFLOPs vs. 3.6 GFLOPs).
+
+          - id: N12
+            type: experiment
+            support_level: explicit
+            source_refs: ["Table 3", "Table 4", "Table 5"]
+            title: "Depth scan with bottleneck blocks (50 / 101 / 152 layers)"
+            description: >-
+              Train ResNet-50 / 101 / 152 with option B shortcuts under the
+              same ImageNet recipe.
+            result: >-
+              Top-1 error drops monotonically (22.85 → 21.75 → 21.43);
+              6-model ensemble achieves 3.57% top-5 on the test set,
+              winning ILSVRC 2015.
+            also_depends_on: [N09, N11]
+
+          - id: N13
+            type: experiment
+            support_level: explicit
+            source_refs: ["Table 6", "Fig. 6", "§4.2"]
+            title: "CIFAR-10 depth scan (20 → 1202 layers)"
+            description: >-
+              Train ResNet-{20, 32, 44, 56, 110, 1202} on CIFAR-10. ResNet-110
+              uses LR warmup (LR 0.01 for ~400 iters until train err <80%,
+              then restore LR 0.1).
+            result: >-
+              Test error decreases through depth 110 (6.43% mean, best
+              6.61 ± 0.16). ResNet-1202 trains successfully (training error
+              <0.1%) but overfits to 7.93% test error.
+            children:
+
+              - id: N14
+                type: decision
+                support_level: explicit
+                source_refs: ["§4.2 paragraph on n=18", "footnote 5"]
+                title: "Warm up LR for the 110-layer CIFAR ResNet"
+                description: >-
+                  At depth 110, LR 0.1 from iter 0 fails to start converging
+                  cleanly. Pre-warm at LR 0.01 for ~400 iters (until train
+                  error <80%), then restore LR 0.1.
+                rationale: >-
+                  Gets the optimizer into a basin where the standard LR
+                  schedule then trains stably. LR 0.1 from start eventually
+                  reaches similar accuracy after several epochs of >90%
+                  error, but warmup is the chosen recipe.
+
+      - id: N15
+        type: experiment
+        support_level: explicit
+        source_refs: ["Table 7", "Table 8", "Appendix A"]
+        title: "Detection transfer with ResNet-101 (PASCAL VOC, COCO)"
+        description: >-
+          Swap VGG-16 for ResNet-101 in baseline Faster R-CNN; fine-tune
+          on PASCAL VOC and COCO.
+        result: >-
+          PASCAL VOC07 mAP: 73.2 → 76.4. COCO mAP@.5: 41.5 → 48.4. COCO
+          mAP@[.5, .95]: 21.2 → 27.2 (28% relative improvement).
+        also_depends_on: [N12]