# PARA Project Status — April 9, 2026 **Pixel-Aligned Robot Actions** — predicting actions in pixel space rather than coordinate space gives you spatial robustness for free and makes video models natural policy backbones. **Headline numbers:** 50% vs 4% spatial extrapolation | 61% vs 24% viewpoint generalization | 92% vs 0% video backbone ![PARA Overview](https://omidlab.net/para_website/media/PARA_Overview_comparison_trainingdata.png) | Experiment | OOD Axis | PARA | ACT | Delta | |---|---|---|---|---| | **(Sim)** Left → Right extrapolation | Object position | **54%** | 1% | **+53%** | | **(Sim)** Near → Far extrapolation | Object position | **46%** | 7% | **+39%** | | **(Sim)** Default → All viewpoints | Camera viewpoint | **61%** | 24% | **+37%** | | **(Real)** Pick and Place | Data efficiency | **97%** | 9% | **+88%** | | **(Real)** Fold Towel | Data efficiency | **97%** | 11% | **+86%** | | **(Real)** Wipe Table | Data efficiency | **95%** | 0% | **+95%** | | **(Real)** New Viewpoint | Camera viewpoint | **52%** | 0% | **+52%** | | **(Real)** New Viewpoint + 5ep F.T. | Camera viewpoint | **87%** | 4% | **+83%** | | **(Real)** New Environment | Visual transfer | **94%** | 0% | **+94%** | ---

Method: How PARA Works

PARA decomposes end-effector action prediction into pixel localization and height estimation — two steps that are naturally equivariant to spatial and viewpoint changes. ![PARA Method Pipeline](https://omidlab.net/para_website/media/PARA_Method_Overview_ray.png) **Three steps:** 1. **Where in the image and where along the ray?** Predict logits along each ray to get a dense 3D heatmap volume. Argmax gives (u, v, z). 2. **3D recovery.** Given (u, v, z), recover the 3D end-effector position via camera pose and intrinsics. 3. **Auxiliary states.** Gripper open/close and rotation predicted by indexing features at the predicted pixel. **Key insight:** Because localization happens in pixel space, it is naturally equivariant to object translation and camera viewpoint changes. The model doesn't need to see every position or viewpoint — it just needs to find the object in the image.

---

Real-Robot Results (SO-100 Arm)

Same viewpoint and environment. 20 demos per task.

Multi-Task Data Efficiency

**Pick and Place — PARA 97% vs ACT 9%**

PARA — 97%

ACT — 9%

**Fold Towel — PARA 97% vs ACT 11%**

PARA — 97%

ACT — 11%

**Wipe Table — PARA 95% vs ACT 0%**

PARA — 95%

ACT — 0%

Robustness to Viewpoint and Environment

Pick-and-place task. Trained at one viewpoint with 20 demos. **New Viewpoint (zero-shot) — PARA 52% vs ACT 0%**

PARA — 52%

ACT — 0%

**New Viewpoint + 5ep Fine-Tune — PARA 87% vs ACT 4%**

PARA — 87%

ACT — 4%

**New Environment — PARA 94% vs ACT 0%**

PARA — 94%

ACT — 0%

---

Spatial Extrapolation (LIBERO Sim)

Trained on one half of the workspace, tested on the other. PARA generalizes to unseen positions; ACT memorizes training locations.

Left → Right Position Extrapolation — PARA 54% vs ACT 1%

128 demos, trained on left half, tested on right half. ![Left-Right Distribution](https://omidlab.net/para_website/media/exp3_leftright_distribution.png)

PARA — 54%

ACT — 1%

Near → Far Position Extrapolation — PARA 46% vs ACT 7%

Trained on near half (closer to robot), tested on far half. ![Near-Far Distribution](https://omidlab.net/para_website/media/exp3_train_test_distribution.png)

PARA — 46%

ACT — 7%

---

Viewpoint Robustness (LIBERO Sim)

Trained at the default camera. Evaluated across 64 viewpoints up to 25° off-axis. **Per-theta breakdown — PARA holds steady, ACT collapses:** | Theta | 0° (train) | 3.6° | 7.1° | 10.7° | 14.3° | 17.9° | 21.4° | 25° | |---|---|---|---|---|---|---|---|---| | **PARA** | 88% | 79% | 62% | 63% | 62% | 62% | 33% | 38% | | **ACT** | 67% | 54% | 42% | 17% | 12% | 0% | 0% | 0% | PARA maintains ~62% through 18°. ACT collapses to 0% after 14°.

Zero-Shot: Default → All Viewpoints — PARA 61% vs ACT 24%

![Viewpoint Polar Overview](https://omidlab.net/para_website/media/vp_default_to_all_polar_overview.png)

PARA — 61%

ACT — 24%

---

Video Models as Policy Backbones

Video diffusion models predict future pixels. PARA reads off actions in that same pixel-aligned space. Global regression cannot. **Previously tried Cosmos Policy** — didn't train well. Switched to SVD (Stable Video Diffusion). | Model | Training Steps | Success Rate | |---|---|---| | Joint from scratch | 10K | 55% | | Frozen UNet + PARA only | 4K + 12K | 0% | | **Two-stage: video 4K → joint 3K (PARA)** | **7K** | **90%** | | Two-stage: video 4K → joint 2K (Global Regression) | 6K | **0%** |

PARA Heads vs Global Regression — Rollout Grids

Same video backbone, same two-stage training. Only difference: PARA spatial heatmap vs avg pool + MLP.

SVD + PARA — 92%

SVD + Global Regression — 0%

---

TODOs

- **Second Embodiment** — Panda rant 😮‍💨, new arm coming soon (within 1 week) - **Video Model in Real-World Task** — Only on libero now. - **Paper writing** — draft being rewritten with all results (renamed KeyGrip → PARA), Figma figures ready - **Additional baselines** — considering Diffusion Policy in LIBERO - **Point track pretraining (exploratory)** — collect ~100 hand demos from robot camera viewpoint as cheap co-training - **DROID pretraining** — Missing Posed-Droid dataset. ![New robot arm](https://omidlab.net/meeting_notes/2026-04-09-para-status/media/new_robot_arx_comparison_2links.png)