# PARA Project Status — April 9, 2026

**Pixel-Aligned Robot Actions** — predicting actions in pixel space rather than coordinate space gives you spatial robustness for free and makes video models natural policy backbones.

**Headline numbers:** 50% vs 4% spatial extrapolation | 61% vs 24% viewpoint generalization | 92% vs 0% video backbone

![PARA Overview](https://omidlab.net/para_website/media/PARA_Overview_comparison_trainingdata.png)

| Experiment | OOD Axis | PARA | ACT | Delta |
|---|---|---|---|---|
| **(Sim)** Left → Right extrapolation | Object position | **54%** | 1% | **+53%** |
| **(Sim)** Near → Far extrapolation | Object position | **46%** | 7% | **+39%** |
| **(Sim)** Default → All viewpoints | Camera viewpoint | **61%** | 24% | **+37%** |
| **(Real)** Pick and Place | Data efficiency | **97%** | 9% | **+88%** |
| **(Real)** Fold Towel | Data efficiency | **97%** | 11% | **+86%** |
| **(Real)** Wipe Table | Data efficiency | **95%** | 0% | **+95%** |
| **(Real)** New Viewpoint | Camera viewpoint | **52%** | 0% | **+52%** |
| **(Real)** New Viewpoint + 5ep F.T. | Camera viewpoint | **87%** | 4% | **+83%** |
| **(Real)** New Environment | Visual transfer | **94%** | 0% | **+94%** |

---

<details>
<summary><h2>Method: How PARA Works</h2></summary>

PARA decomposes end-effector action prediction into pixel localization and height estimation — two steps that are naturally equivariant to spatial and viewpoint changes.

![PARA Method Pipeline](https://omidlab.net/para_website/media/PARA_Method_Overview_ray.png)

**Three steps:**
1. **Where in the image and where along the ray?** Predict logits along each ray to get a dense 3D heatmap volume. Argmax gives (u, v, z).
2. **3D recovery.** Given (u, v, z), recover the 3D end-effector position via camera pose and intrinsics.
3. **Auxiliary states.** Gripper open/close and rotation predicted by indexing features at the predicted pixel.

**Key insight:** Because localization happens in pixel space, it is naturally equivariant to object translation and camera viewpoint changes. The model doesn't need to see every position or viewpoint — it just needs to find the object in the image.

</details>

---

<details open>
<summary><h2>Real-Robot Results (SO-100 Arm)</h2></summary>

Same viewpoint and environment. 20 demos per task.

<details open>
<summary><h3>Multi-Task Data Efficiency</h3></summary>

**Pick and Place — PARA 97% vs ACT 9%**

<table><tr>
<td align="center"><video src="https://omidlab.net/para_website/media/ours_20episodes_pickplace_8x.mp4" muted autoplay loop playsinline width="100%"></video><br>PARA — <strong>97%</strong></td>
<td align="center"><video src="https://omidlab.net/para_website/media/act_baseline_exp_20ep_pickplace_8x.mp4" muted autoplay loop playsinline width="100%"></video><br>ACT — <strong>9%</strong></td>
</tr></table>

**Fold Towel — PARA 97% vs ACT 11%**

<table><tr>
<td align="center"><video src="https://omidlab.net/para_website/media/ours_towelfolding_8x.mp4" muted autoplay loop playsinline width="100%"></video><br>PARA — <strong>97%</strong></td>
<td align="center"><video src="https://omidlab.net/para_website/media/foldtowel_act_8x.mp4" muted autoplay loop playsinline width="100%"></video><br>ACT — <strong>11%</strong></td>
</tr></table>

**Wipe Table — PARA 95% vs ACT 0%**

<table><tr>
<td align="center"><video src="https://omidlab.net/para_website/media/ours_towelwipe_8x.mp4" muted autoplay loop playsinline width="100%"></video><br>PARA — <strong>95%</strong></td>
<td align="center"><video src="https://omidlab.net/para_website/media/act_wipetowel_8x.mp4" muted autoplay loop playsinline width="100%"></video><br>ACT — <strong>0%</strong></td>
</tr></table>

</details>

<details open>
<summary><h3>Robustness to Viewpoint and Environment</h3></summary>

Pick-and-place task. Trained at one viewpoint with 20 demos.

**New Viewpoint (zero-shot) — PARA 52% vs ACT 0%**

<table><tr>
<td align="center"><video src="https://omidlab.net/para_website/media/ours_newviewpoint_15ep_8x.mp4" muted autoplay loop playsinline width="100%"></video><br>PARA — <strong>52%</strong></td>
<td align="center"><video src="https://omidlab.net/para_website/media/oodview_act_baseline_8x.mp4" muted autoplay loop playsinline width="100%"></video><br>ACT — <strong>0%</strong></td>
</tr></table>

**New Viewpoint + 5ep Fine-Tune — PARA 87% vs ACT 4%**

<table><tr>
<td align="center"><video src="https://omidlab.net/para_website/media/ood_ft_ours_8x.mp4" muted autoplay loop playsinline width="100%"></video><br>PARA — <strong>87%</strong></td>
<td align="center"><video src="https://omidlab.net/para_website/media/ood_ft_act_baseline_8x.mp4" muted autoplay loop playsinline width="100%"></video><br>ACT — <strong>4%</strong></td>
</tr></table>

**New Environment — PARA 94% vs ACT 0%**

<table><tr>
<td align="center"><video src="https://omidlab.net/para_website/media/ours_basemodel_newenv_8x.mp4" muted autoplay loop playsinline width="100%"></video><br>PARA — <strong>94%</strong></td>
<td align="center"><video src="https://omidlab.net/para_website/media/act_newenv_8x.mp4" muted autoplay loop playsinline width="100%"></video><br>ACT — <strong>0%</strong></td>
</tr></table>

</details>

</details>

---

<details open>
<summary><h2>Spatial Extrapolation (LIBERO Sim)</h2></summary>

Trained on one half of the workspace, tested on the other. PARA generalizes to unseen positions; ACT memorizes training locations.

<details open>
<summary><h3>Left → Right Position Extrapolation — PARA 54% vs ACT 1%</h3></summary>

128 demos, trained on left half, tested on right half.

![Left-Right Distribution](https://omidlab.net/para_website/media/exp3_leftright_distribution.png)

<table><tr>
<td align="center"><video src="https://omidlab.net/para_website/media/objpos_lr_para_rollout_grid.mp4" muted autoplay loop playsinline width="100%"></video><br>PARA — <strong>54%</strong></td>
<td align="center"><video src="https://omidlab.net/para_website/media/objpos_lr_act_rollout_grid.mp4" muted autoplay loop playsinline width="100%"></video><br>ACT — <strong>1%</strong></td>
</tr></table>

</details>

<details>
<summary><h3>Near → Far Position Extrapolation — PARA 46% vs ACT 7%</h3></summary>

Trained on near half (closer to robot), tested on far half.

![Near-Far Distribution](https://omidlab.net/para_website/media/exp3_train_test_distribution.png)

<table><tr>
<td align="center"><video src="https://omidlab.net/para_website/media/objpos_nf_para_rollout_grid.mp4" muted autoplay loop playsinline width="100%"></video><br>PARA — <strong>46%</strong></td>
<td align="center"><video src="https://omidlab.net/para_website/media/objpos_nf_act_rollout_grid.mp4" muted autoplay loop playsinline width="100%"></video><br>ACT — <strong>7%</strong></td>
</tr></table>

</details>

</details>

---

<details open>
<summary><h2>Viewpoint Robustness (LIBERO Sim)</h2></summary>

Trained at the default camera. Evaluated across 64 viewpoints up to 25° off-axis.

**Per-theta breakdown — PARA holds steady, ACT collapses:**

| Theta | 0° (train) | 3.6° | 7.1° | 10.7° | 14.3° | 17.9° | 21.4° | 25° |
|---|---|---|---|---|---|---|---|---|
| **PARA** | 88% | 79% | 62% | 63% | 62% | 62% | 33% | 38% |
| **ACT** | 67% | 54% | 42% | 17% | 12% | 0% | 0% | 0% |

PARA maintains ~62% through 18°. ACT collapses to 0% after 14°.

<details open>
<summary><h3>Zero-Shot: Default → All Viewpoints — PARA 61% vs ACT 24%</h3></summary>

![Viewpoint Polar Overview](https://omidlab.net/para_website/media/vp_default_to_all_polar_overview.png)

<table><tr>
<td align="center"><video src="https://omidlab.net/para_website/media/vp_default_to_all_para_rollout_grid.mp4" muted autoplay loop playsinline width="100%"></video><br>PARA — <strong>61%</strong></td>
<td align="center"><video src="https://omidlab.net/para_website/media/vp_default_to_all_act_rollout_grid.mp4" muted autoplay loop playsinline width="100%"></video><br>ACT — <strong>24%</strong></td>
</tr></table>

</details>

</details>

---

<details open>
<summary><h2>Video Models as Policy Backbones</h2></summary>

Video diffusion models predict future pixels. PARA reads off actions in that same pixel-aligned space. Global regression cannot.

**Previously tried Cosmos Policy** — didn't train well. Switched to SVD (Stable Video Diffusion).

| Model | Training Steps | Success Rate |
|---|---|---|
| Joint from scratch | 10K | 55% |
| Frozen UNet + PARA only | 4K + 12K | 0% |
| **Two-stage: video 4K → joint 3K (PARA)** | **7K** | **90%** |
| Two-stage: video 4K → joint 2K (Global Regression) | 6K | **0%** |

<details open>
<summary><h3>PARA Heads vs Global Regression — Rollout Grids</h3></summary>

Same video backbone, same two-stage training. Only difference: PARA spatial heatmap vs avg pool + MLP.

<table><tr>
<td align="center"><video src="https://omidlab.net/para_website/media/rollout_grid_para.mp4" muted autoplay loop playsinline width="100%"></video><br>SVD + PARA — <strong>92%</strong></td>
<td align="center"><video src="https://omidlab.net/para_website/media/rollout_grid_global.mp4" muted autoplay loop playsinline width="100%"></video><br>SVD + Global Regression — <strong>0%</strong></td>
</tr></table>

</details>

</details>

---

<details open>
<summary><h2>TODOs</h2></summary>

- **Second Embodiment** — Panda rant 😮‍💨, new arm coming soon (within 1 week)
- **Video Model in Real-World Task** — Only on libero now. 
- **Paper writing** — draft being rewritten with all results (renamed KeyGrip → PARA), Figma figures ready
- **Additional baselines** — considering Diffusion Policy in LIBERO
- **Point track pretraining (exploratory)** — collect ~100 hand demos from robot camera viewpoint as cheap co-training
- **DROID pretraining** — Missing Posed-Droid dataset.

![New robot arm](https://omidlab.net/meeting_notes/2026-04-09-para-status/media/new_robot_arx_comparison_2links.png)

</details>
