# Advisor Meeting Notes — April 9, 2026

## 1. Video Model + PARA (New Result)

- Previously tried Cosmos Policy as the video backbone — didn't train well
- Switched to SVD (Stable Video Diffusion) fine-tuned on our task data
- **Two-stage training recipe:** pretrain video UNet (4K steps) → jointly fine-tune UNet + PARA heads (3K steps)
- **Result: 90% success (18/20)** vs 55% joint-from-scratch vs 0% frozen backbone
- Key finding: PARA heads + video features need joint co-adaptation (frozen UNet → 0%)
- Also ran global regression baseline on same video features: **0/20 (0%)**
  - Same UNet features, avg pool + MLP instead of spatial heatmap → complete failure
  - Confirms PARA's spatial inductive bias is critical for extracting actions from video features

## 2. LIBERO OOD Generalization Experiments (Complete)

Six key experiments comparing PARA vs ACT, all on LIBERO spatial task 0:

| Experiment | PARA | ACT | Delta |
|---|---|---|---|
| Left → Right position extrapolation | 54% | 1% | +53% |
| Near → Far position extrapolation | 46% | 7% | +39% |
| Zero-shot viewpoint (default → all 64) | 61% | 24% | +37% |
| Viewpoint hemisphere (left → right) | 40% | 10% | +30% |
| N=32 data scaling | 54% | 33% | +21% |
| Distractor robustness | 28% | 10% | +18% |

- Viewpoint per-theta breakdown: PARA holds ~62% through 18°, ACT collapses to 0% after 14°
- Failure modes: PARA fails on gripper timing (drops objects), ACT fails on reaching wrong locations

## 3. Real Robot Results (SO-100, existing)

- 3 tasks with 20 demos each: Pick-and-Place 97%, Fold Towel 97%, Wipe Table 95% (vs ACT 9%, 11%, 0%)
- OOD: zero-shot viewpoint 52% vs 0%, new environment 94% vs 0%
- Also compared against Motion Tracks baseline

## 4. Next Steps

- **Building second robot arm (Panda)** — target: working within ~1 week
  - Camera calibration pipeline exists (ArUco-based), needs verification
  - Data collection and PARA deployment on Panda for multi-embodiment results
- **Paper writing in progress** — rewriting draft with current results, method renamed from KeyGrip to PARA
- **Considering additional baselines:** Diffusion Policy in LIBERO to show the OOD failure isn't ACT-specific but a property of coordinate regression
- **Point track pretraining (exploratory):** could collect ~100 hand demos from robot's camera viewpoint as cheap co-training data; larger-scale pretraining on DROID dataset (55% downloaded) as future direction
- **Figures:** Figma figures for method overview, results, height-vs-depth illustration ready; need LIBERO results figure