# PARA Website Notes

## One-Line Pitch

**Predicting actions in pixel space rather than coordinate space gives you spatial robustness for free and makes video models natural policy backbones.**

## Core Contributions

### Contribution 1: Pixel-Aligned Action Prediction is Inherently Robust to Spatial/Viewpoint Distribution Shift

Standard policy architectures (e.g., ACT) regress actions from a global CLS token — forcing the model to implicitly solve correspondence, geometry, and control in an unstructured output space. PARA decomposes this: predict *where* in the image (2D heatmap), then *how high* (height bins along the ray). Because localization is equivariant to object translation and camera viewpoint changes in the image, PARA generalizes to unseen positions and viewpoints without needing to see them in training.

### Contribution 2: Pixel Alignment Makes Video Models Directly Useful as Action Backbones

Video diffusion models already predict future pixel states. PARA reads off actions in that same pixel-aligned space, requiring far fewer robot demonstrations than global regression. A two-stage recipe (pretrain video model, then jointly fine-tune with PARA heads) achieves 90% success — dramatically outperforming joint-from-scratch (55%) and proving that co-adaptation of video features and PARA heads is essential (frozen backbone → 0%).

---

## Key Results for Website

### Section 1: OOD Object Position Generalization (PARA vs ACT)

**Headline number:** PARA 54% vs ACT 1% on left→right spatial extrapolation (trained on left half of workspace, tested on right half).

Focus on the 6 experiments with the largest PARA-ACT deltas:

| # | Experiment | OOD Axis | PARA | ACT | Delta |
|---|---|---|---|---|---|
| 1 | Left → Right position extrapolation | Object position | **54%** | 1% | **+53%** |
| 2 | Near → Far position extrapolation | Object position | **46%** | 7% | **+39%** |
| 3 | Default → All viewpoints (zero-shot) | Camera viewpoint | **61%** | 24% | **+37%** |
| 4 | Left → Right viewpoint hemisphere | Camera viewpoint | **40%** | 10% | **+30%** |
| 5 | N=32 corner scaling | Data efficiency | **54%** | 33% | **+21%** |
| 6 | Distractor robustness | Visual clutter | **28%** | 10% | **+18%** |

**Key takeaway:** PARA dominates extrapolation (+39–53%) and viewpoint robustness (+30–37%). ACT only catches up with dense coverage (N=64: 68% vs 71%).

**Failure mode contrast:** PARA fails on gripper timing (grasps but drops). ACT fails on reaching entirely wrong locations — it memorizes absolute positions.

**Training setup:** Both models use DINOv2 ViT-S/16 backbone, teleport servo evaluation, zero rotation, clean scene, 10 min training. Object position dataset: 16x16 grid (256 demos) across 39cm x 60cm workspace. Viewpoint dataset: 8x8 viewpoint grid (theta_max=25 deg) x 10 demos per viewpoint.

### Section 2: Viewpoint Generalization (Per-Theta Breakdown)

**Headline number:** PARA 61% vs ACT 24% zero-shot across all viewpoints. PARA holds ~62% through theta=18 deg. ACT collapses to 0% after theta=14 deg.

Both models trained at default camera viewpoint (theta=0 deg) with 64 diverse object positions. Evaluated across full 8x8 viewpoint grid (theta up to 25 deg).

| Theta | 0 deg (train) | 3.6 deg | 7.1 deg | 10.7 deg | 14.3 deg | 17.9 deg | 21.4 deg | 25 deg |
|---|---|---|---|---|---|---|---|---|
| PARA | 88% | 79% | 62% | 63% | 62% | 62% | 33% | 38% |
| ACT | 67% | 54% | 42% | 17% | 12% | 0% | 0% | 0% |

**Key takeaway:** This should be a **line chart** on the website. PARA as a plateau at ~62% that gracefully degrades. ACT as a cliff that drops to 0% at 18 deg. Visually memorable.

**Viewpoint hemisphere (Exp 4):** Train on left hemisphere viewpoints (phi=0-135 deg, 32 viewpoints x 10 demos), test on right hemisphere (phi=180-315 deg). PARA 40% vs ACT 10%.

### Section 3: Video Model as Policy Backbone

**Headline number:** Two-stage (video pretrain → joint PARA fine-tune) achieves 90% success in 7K total steps.

| Model | Steps | Success Rate |
|---|---|---|
| Joint from scratch | 10K | 55% |
| Frozen UNet + PARA only | 4K + 12K | 0% |
| **Two-stage: video 4K → joint 3K** | **7K** | **90%** |

**Key takeaway:** PARA heads can ride on video diffusion features, but joint co-adaptation is essential. Video pretraining + PARA gives both better performance and faster convergence.

*TODO: Add Video+Global Regression baseline (data efficiency sweep over N={4,16,64,256} demos) — experiment in progress. `episode_global_*` videos already exist in vid_model media.*

---

## Videos to Include on Website

All backbone media paths relative to: `.agents/reports/backbones/media/`
All vid_model media paths relative to: `.agents/reports/vid_model/media/`

### Must-Have: Side-by-Side Comparisons

Each experiment below has a **sanity pair** (both succeed in-distribution, proves baseline isn't broken) and a **test pair** (PARA succeeds OOD, ACT fails).

#### 1. Left → Right Position Extrapolation (Exp 1, +53%)
**Sanity (in-distribution — both succeed):**
- `para_lr_sanity_ep000_success.mp4` — PARA at far-left position (in train)
- `act_lr_sanity_ep000_success.mp4` — ACT at same position (in train)

**Test (OOD — PARA succeeds, ACT fails):**
- `para_lr_test_ep000_success.mp4` — PARA at far-right position (never seen)
- `act_lr_test_ep000_fail.mp4` — ACT at same far-right position (reaches wrong location)

**Distribution plot:** `exp3_leftright_distribution.png` — green=128 train (left half), blue=20 test (right half)

*Why this is the money shot:* Same test position, one succeeds one fails, and ACT's failure mode (reaching to memorized location) visually explains WHY pixel alignment helps.

#### 2. Near → Far Position Extrapolation (Exp 2, +39%)
**Sanity:**
- `para_nf_sanity2_ep000_success.mp4` — PARA at near position (in train)
- `act_nf_sanity2_ep000_success.mp4` — ACT at same near position (in train)

**Test:**
- `para_nf_test_ep000_success.mp4` — PARA at far position (never seen)
- `act_nf_test_ep000_fail.mp4` — ACT at same far position

**Distribution plot:** `exp3_train_test_distribution.png` — green=128 train (near half), blue=20 test (far half)

#### 3. Zero-Shot Viewpoint Generalization (Exp 3, +37%) — HERO VIDEOS
**Rollout grid videos (most compelling single asset):**
- `vp_default_to_all_para_rollout_grid.mp4` — 5x5 grid of PARA rollouts across viewpoints. Green checks = success, red X = failure. Rows = theta, columns = phi.
- `vp_default_to_all_act_rollout_grid.mp4` — Same grid for ACT. Mostly red X's at non-default viewpoints.

**Polar plot:** `vp_default_to_all_polar_overview.png` — single green train dot at theta=0 deg, blue test dots across full grid, with sample frames.

*Why this is the hero:* Two grids of simultaneous rollouts. PARA is mostly green. ACT is mostly red. Instant visual argument. No explanation needed.

#### 4. Viewpoint Hemisphere (Exp 4, +30%)
**Rollout grids:**
- `vp_lr_para_rollout_grid.mp4` — PARA rollouts on right hemisphere (unseen)
- `vp_lr_act_rollout_grid.mp4` — ACT rollouts on right hemisphere

**Polar plot:** `vp_lr_polar_overview.png` — green=left hemi train, blue=right hemi test
**Frame comparison:** `vp_leftright_hemi_distribution.png` — train vs test frame examples

#### 5. Distractor Robustness (Exp 6, +18%)
- `para_distractor_contrast_ep000_success.mp4` — PARA succeeds with distractors
- `act_distractor_contrast_ep000_fail.mp4` — ACT fails with distractors

**Distribution:** `distractor_robustness_distribution.png` — clean train vs cluttered test

#### 6. Two-Stage Video Policy Rollouts (90% success)
- `episode_v3_twostage_0.mp4` — success
- `episode_v3_twostage_17.mp4` — success (short/clean completion)
- `episode_v3_twostage_7.mp4` — failure (longest episode, likely timeout)

*TODO: Once global regression baseline is complete, add side-by-side:*
- `episode_global_0.mp4`, `episode_global_1.mp4`, `episode_global_7.mp4`, `episode_global_14.mp4`

### Nice-to-Have

7. **Viewpoint inner/outer hemisphere:**
   - `vp_io_para_rollout_grid.mp4` + `vp_io_act_rollout_grid.mp4`
   - `vp_io_polar_overview.png`

8. **Heatmap visualizations (interpretability teaser):**
   - `gt_vs_gen_heatmaps_200_70866a2c3eed1c43bdd6.mp4`
   - `gt_vs_gen_heatmaps_2000_ac08e6dce12d696dc874.mp4`

9. **Video model alignment:**
   - `ood_alignment.mp4` — temporal alignment between generated frames and PARA predictions

10. **Training data reference:**
    - `train_demo_0.mp4`, `train_demo_128.mp4`, `train_demo_255.mp4`

11. **Data scaling progression:**
    - `exp4_n8_distribution.png`, `exp4_n16_distribution.png`, `exp4_n32_distribution.png`, `exp4_n64_distribution.png`

### Skip (not website-worthy)
- Individual 20-episode evals (`episode_v3_20ep_*`) — too many, redundant
- `debug_rollout.mp4` — debug artifact
- DROID sample videos — no results yet
- `ood_0000.mp4`, `ood_0010.mp4` — early debugging
- Older experiment videos (`para_exp3_*`, `act_exp3_*`) — superseded by new sanity/test naming

---

## Figures / Images to Include

### Publication-Quality Distribution Plots (backbone experiments)
- `exp3_leftright_distribution.png` — left→right train/test split overlay on scene
- `exp3_train_test_distribution.png` — near→far split
- `exp4_n32_distribution.png` — N=32 data scaling visualization
- `distractor_robustness_distribution.png` — clean vs cluttered scene comparison

### Viewpoint Polar Plots
- `vp_default_to_all_polar_overview.png` — zero-shot: single train viewpoint vs full test grid
- `vp_lr_polar_overview.png` — hemisphere: left train vs right test
- `vp_io_polar_overview.png` — inner/outer hemisphere split

### Viewpoint Frame Comparisons
- `vp_leftright_hemi_distribution.png` — train vs test frames at different viewpoints

### Video Model
- `frame_0.png`, `frame_2.png` — GT vs SVD-generated video frames

---

## Website Structure

### 1. Hero
- One-line pitch
- Method diagram / Figure 1 (TODO: not yet created)
- Summary results table (6 experiments)

### 2. Method
- Brief PARA formulation: heatmap → height bins → 3D recovery
- Heatmap visualization video (`gt_vs_gen_heatmaps_*.mp4`)

### 3. Result 1 — Spatial Extrapolation
- Distribution plots showing train/test splits
- Per-theta line chart (PARA plateau vs ACT cliff) — generate from the table above
- Side-by-side sanity+test videos for Exp 1 and Exp 2

### 4. Result 2 — Viewpoint Robustness
- Polar plots
- **Rollout grid videos** — PARA grid (mostly green) vs ACT grid (mostly red), playing simultaneously
- Per-theta line chart

### 5. Result 3 — Video as Policy Backbone
- Two-stage training diagram
- Success rate table
- Rollout videos (PARA heads vs global regression, once available)
- Data efficiency sweep chart (TODO: in progress)

### 6. Bonus: Distractor Robustness + Interpretability
- Distractor side-by-side videos
- Heatmap visualization

---

## Open Items

- [ ] Video+Global Regression data efficiency sweep (in progress) — `episode_global_*` videos exist, need full results
- [ ] Method diagram / Figure 1 — not yet created
- [ ] Multi-task results — currently single task only, needed before paper submission
- [ ] Generate per-theta line chart from viewpoint breakdown table
- [ ] Verify `episode_global_*` results and add to Section 3
