# PARA Keynote Presentation Notes

## Talk Structure (12-15 minutes)

Lead with the problem, make the insight feel obvious, then overwhelm with results. Videos are the main evidence, not tables.

---

## Slide-by-Slide Breakdown

### Slide 1: Title
**PARA: Pixel-Aligned Robot Actions**
*Are More Data-Efficient and Robust to New Viewpoints and Environments*

- Hero image: `para_overview.png` (few demos → robust policies)

---

### Slide 2: The Promise
**"Modern vision features are really good"**

- DINO/SigLIP feature visualizations — same object matched across viewpoints
- These features are multiview-consistent, spatially aware, pretrained on billions of images
- Set expectation: policies built on these should be robust... right?

---

### Slide 3: The Reality (ATTENTION GRABBER)
**"But robot policies still break"**

- Side-by-side video: ACT failing when camera moves + ACT failing when object shifts
- Videos: `oodview_act_baseline_8x.mp4`, `act_newenv_8x.mp4`
- Big text: **"Same task. Same model. Just moved the camera."**

---

### Slide 4: Why?
**"The problem isn't the features — it's the action head"**

- Diagram: Image → DINO features (spatial, rich) → CLS token → MLP → (x, y, z)
- Highlight the bottleneck: global aggregation + coordinate regression
- Sparse supervision: 1 label per image
- Must learn camera x scene x robot mapping from limited demos
- Figure: left panel of `para_overview_comparison_trainingdata.png`

---

### Slide 5: The Wrong Fix
**"Why not go 3D?"**

Quick slide — three strikes:
1. Needs depth/multi-view hardware
2. Doesn't help OOD object position (several works have shown this)
3. Trains from scratch, losing pretrained 2D features

*"The 2D features are already multiview-consistent. We don't need to reconstruct 3D — we need to use 2D features correctly."*

---

### Slide 6: The Intuition (KEY SLIDE — spend time here)
**"It's just keypoint detection"**

Split screen:
- Left: "How would you detect a keypoint from few examples?" → obviously a spatial heatmap
- Right: "How does the field predict robot actions?" → pool features + regress coordinates

Punchline: *"Robot action prediction IS localization. We've been doing it wrong."*

---

### Slide 7: PARA Method
**"Predict where in the image, lift with geometry"**

- Figure: `para_method_overview.png`
- Step 1: Per-pixel heatmap volume (where in image + how high along ray)
- Step 2: Argmax → (u, v, height)
- Step 3: Camera geometry → 3D position
- Step 4: Gripper/rotation from features at predicted pixel
- No depth sensor, no 3D reconstruction, single RGB camera

---

### Slide 8: Why Height, Not Depth
**"Height is camera-invariant"**

- Figure: `height_illustration.png`
- Depth changes with every camera move. Height doesn't.
- *"This is why PARA generalizes across viewpoints without seeing them."*

---

### Slide 9: Dense Supervision
**"20 demos, millions of training signals"**

- CLS regression: 20 demos → 20 coordinate labels
- PARA: 20 demos x 200K pixels → 4M pixel-level signals
- *"Every pixel in every training image teaches the model something."*

---

### Slide 10: Real Robot Results (HERO SLIDE — let videos play)
**"97% vs 9% with 20 demonstrations"**

Video wall — all three tasks side by side:
- Pick-and-place: PARA 97% vs ACT 9% (`ours_20episodes_pickplace_8x.mp4` vs `act_baseline_exp_20ep_pickplace_8x.mp4`)
- Fold towel: PARA 97% vs ACT 11% (`ours_towelfolding_8x.mp4` vs `foldtowel_act_8x.mp4`)
- Wipe table: PARA 95% vs ACT 0% (`ours_towelwipe_8x.mp4` vs `act_wipetowel_8x.mp4`)

Don't over-narrate. The contrast speaks for itself.

---

### Slide 11: New Environment (SECOND HERO)
**"94% vs 0% in a completely new environment"**

- Side-by-side video: `ours_basemodel_newenv_8x.mp4` vs `act_newenv_8x.mp4`
- Same model, zero fine-tuning, never seen this room
- Also: new viewpoint zero-shot 52% vs 0%, new viewpoint + 5ep fine-tune 87% vs 4%

---

### Slide 12: Spatial Extrapolation (LIBERO)
**"Train left, test right — PARA follows the object"**

- Distribution plot: `exp3_leftright_distribution.png`
- Rollout grids: `objpos_lr_para_rollout_grid.mp4` vs `objpos_lr_act_rollout_grid.mp4`
- PARA 54% vs ACT 1%
- *"ACT memorizes positions. PARA follows the object in the image."*

---

### Slide 13: Viewpoint Robustness (LIBERO)
**"PARA holds steady. ACT collapses."**

- Per-theta line chart: PARA plateau at ~62%, ACT cliff to 0%
- Rollout grids: `vp_default_to_all_para_rollout_grid.mp4` vs ACT version
- 61% vs 24% overall, ACT hits 0% after 14 degrees
- This chart is very memorable

---

### Slide 14: Video Models as Policy Backbones
**"Pixel alignment makes video models directly useful"**

- Video diffusion predicts future pixels → PARA reads off actions in the same space
- Rollout grids: `rollout_grid_para.mp4` vs `rollout_grid_global.mp4`
- PARA 92% vs global regression 0%
- Same backbone, same training — only the head differs
- *"Global regression cannot extract actions from video features. PARA can."*

---

### Slide 15: Summary
**"One change to the action head"**

| Setting | PARA | Baseline | Delta |
|---|---|---|---|
| Real robot (20 demos) | 97% | 9% | +88% |
| New environment | 94% | 0% | +94% |
| Spatial extrapolation | 54% | 1% | +53% |
| Viewpoint zero-shot | 61% | 24% | +37% |
| Video backbone | 92% | 0% | +92% |

*"Predicting actions in pixel space rather than coordinate space gives you spatial robustness for free and makes video models natural policy backbones."*

No new backbone. No depth sensor. No 3D reconstruction. Just the right question.

---

### Slide 16: Future Work (optional)
- Second embodiment (new arm coming)
- Large-scale pretraining (DROID)
- Longer-horizon multi-step tasks
- Point track pretraining from human demonstrations

---

## Design Notes

- **Videos over tables.** Every results slide should have a playing video. Tables go in the paper.
- **Dark background** for video slides so robot footage pops.
- **Minimal text** — most slides: one headline + one visual.
- **Slides 10-11** (real robot results) take the most time. Let audience watch.
- **Slide 6** (keypoint intuition) is the intellectual climax. Make sure audience gets the "aha."
- All video files are in `/data/cameron/para/.agents/reports/project_site/media/`
- All Figma figures are in `/data/cameron/para/paper/figs/figma/`