# PARA Project Video — Storyboard

## Design Principles

1. **Start high-level input/output, not technical details.** Like Dust3R says "images in → geometry out" not "point map regression model." PARA should be introduced as: "demonstrations in → a policy that generalizes to new viewpoints, positions, and environments."
2. **All visualizations in LIBERO first.** Controlled environment, easy to generate comparisons, consistent rendering.
3. **Show the paradox before the solution.** Features are multiview-consistent but policies aren't. Why?
4. **No narration assumed** — text overlays + visuals should tell the story. Can add voiceover later.

---

## Sequence

### Act 1: The Promise and the Problem (~20s)

**Shot 1: High-level intro (3s)**
Text: *"A robot policy that generalizes to new viewpoints, object positions, and environments"*
Visual: Quick montage — robot succeeding across different setups (PARA working in various conditions)

**Shot 2: Features are multiview-consistent (5-7s)**
- LIBERO scene rendered from 4-6 different camera viewpoints
- Overlay DINO PCA feature visualization on each view
- Same object → same colors in PCA across all viewpoints
- Text: *"Modern image features are multiview-consistent"*
- Key: show the feature map smoothly rotating/transitioning across viewpoints, colors staying consistent on the bowl/plate/robot

**Shot 3: Policy works in-distribution (3s)**
- LIBERO: ACT policy succeeding at default viewpoint + default object position
- Text: *"Standard policies work in the training setup"*

**Shot 4: Policy breaks OOD (5-7s)**
- Same LIBERO task, camera shifted ~15-20°
- ACT policy failing — reaching to wrong location or freezing
- Same task, object shifted to unseen position
- ACT failing again
- Text: *"But break when anything changes"*
- Show the DINO features are STILL consistent (split screen: features look fine, but robot fails)

**Shot 5: The question (2s)**
Text: *"The features generalize. Why doesn't the policy?"*

### Act 2: The Insight (~15s)

**Shot 6: The bottleneck (5s)**
- Animated diagram: spatial feature map → CLS token (compression) → MLP → (x,y,z) coordinates
- Highlight the compression step — rich spatial features squeezed into one vector
- Text: *"Standard policies aggregate spatial features and regress global coordinates"*

**Shot 7: PARA approach — high-level input/output (5s)**
- Same feature map → per-pixel heatmap (where should the gripper be?) → camera geometry → 3D position
- Show the heatmap lighting up at the target pixel
- Text: *"PARA predicts actions in pixel space — local, dense, equivariant"*
- Keep it visual, no equations

**Shot 8: Feature map comparison (5s)**
- Split screen in LIBERO, both at an OOD viewpoint:
  - Left: PARA — feature PCA + heatmap prediction, still correctly locating the target
  - Right: ACT — feature PCA looks fine but the CLS→MLP output points to wrong location
- Text: *"Same features. Different action prediction. Different outcome."*

### Act 3: Results (~30s)

**Shot 9: LIBERO spatial extrapolation (5s)**
- Quick cut: distribution plot (train left, test right)
- Rollout grid: PARA mostly green, ACT mostly red
- Text: *"Train left, test right. 54% vs 1%."*

**Shot 10: LIBERO viewpoint robustness (5s)**
- Per-theta chart animating: PARA holds, ACT drops
- Rollout grid comparison
- Text: *"Zero-shot viewpoint transfer. 61% vs 24%."*

**Shot 11: Real robot results (10s)**
- Video wall: all three tasks, PARA vs ACT side-by-side
- Pick-and-place, fold towel, wipe table
- Text: *"Real robot. 20 demonstrations. 97% vs 9%."*

**Shot 12: New environment (5s)**
- Side-by-side: PARA succeeding vs ACT failing in completely new room
- Text: *"New environment. Never seen. 94% vs 0%."*

**Shot 13: Video backbone (5s)**
- Rollout grids: PARA vs global regression on same video features
- Text: *"Video model + PARA: 92%. Video model + global regression: 0%."*

### Closing (3s)

**Shot 14: Title card**
- PARA: Pixel-Aligned Robot Actions
- Project page URL
- Paper link (when ready)

---

## Visualizations to Generate in LIBERO

### Priority 1: Feature consistency across viewpoints
- Render LIBERO task 0 scene from 6-8 viewpoints (using the viewpoint grid we already have)
- Extract DINO ViT-S/16 features at each viewpoint
- Compute PCA across all views jointly (so colors are consistent)
- Overlay PCA visualization on each rendered frame
- Output: a smooth video panning across viewpoints showing features stay consistent

### Priority 2: Policy comparison at OOD viewpoint
- Pick one clear OOD viewpoint (e.g., theta=15°, where PARA succeeds and ACT fails)
- Record rollout of PARA succeeding
- Record rollout of ACT failing
- Side-by-side video with success/failure labels

### Priority 3: Feature map PCA comparison — PARA vs ACT
- At the same OOD viewpoint, during inference:
  - PARA: show the spatial feature map PCA + the heatmap prediction overlaid on the image
  - ACT: show the spatial feature map PCA + indicate where the CLS regression output projects to in the image
- This visualizes: features look similar, but PARA uses them locally (heatmap at the right pixel) while ACT uses them globally (regression to wrong 3D point)

### Priority 4: In-distribution success (both work)
- Default viewpoint, default object position
- Both PARA and ACT succeeding
- Quick clip to establish "both work when conditions match training"

### Priority 5 (stretch): Cross-embodiment
- If LIBERO supports different robot models, show PARA generalizing across embodiments
- Or show point tracks transferring across embodiments
- This would be a bonus result, not required for the video

---

## Technical Notes

### DINO PCA Visualization
```python
# Extract features and compute PCA
features = dino_model.get_intermediate_layers(image, n=1)[0]  # (1, N_patches, C)
features = features.reshape(H_patches, W_patches, C)

from sklearn.decomposition import PCA
pca = PCA(n_components=3)
pca_features = pca.fit_transform(features.reshape(-1, C))
pca_rgb = pca_features.reshape(H_patches, W_patches, 3)

# Normalize to [0, 1] for visualization
pca_rgb = (pca_rgb - pca_rgb.min()) / (pca_rgb.max() - pca_rgb.min())

# Upsample to image resolution
pca_rgb = cv2.resize(pca_rgb, (W, H))

# Blend with original image
overlay = 0.5 * image_normalized + 0.5 * pca_rgb
```

For consistent colors across viewpoints: fit PCA on all viewpoints concatenated, then transform each individually.

### LIBERO Viewpoint Rendering
We already have the viewpoint grid infrastructure from the OOD experiments. Use:
```bash
python eval.py --model_type para --checkpoint CKPT \
    --cam_theta THETA --cam_phi PHI \
    --save_video --save_features  # add flag to save intermediate features
```

### Heatmap Visualization
Extract the volume head output before argmax, take the spatial max over height bins to get a 2D heatmap, overlay on the image with a colormap (e.g., jet or magma).

---

## Video Production Notes

- **Resolution:** 1920x1080 or 1280x720
- **Frame rate:** 30fps for smooth playback
- **Video codec:** H.264 for web compatibility
- **Duration target:** 60-90 seconds total
- **Music:** Optional, subtle background. Not required.
- **Text:** Clean sans-serif (Source Sans or similar), white on dark or dark on light depending on background
- **Transitions:** Simple cuts or cross-fades. No flashy transitions.
- **Autoplay on website:** First 5-10 seconds should work as a silent, autoplay hero loop
