Pixel-Aligned Robot Actions
Predicting actions in pixel space rather than coordinate space gives you spatial robustness for free—and makes video models natural policy backbones.
| Experiment | OOD Axis | PARA | ACT | Delta |
|---|---|---|---|---|
| Left → Right extrapolation | Object position | 54% | 1% | +53% |
| Near → Far extrapolation | Object position | 46% | 7% | +39% |
| Default → All viewpoints | Camera viewpoint | 61% | 24% | +37% |
| Left → Right hemisphere | Camera viewpoint | 40% | 10% | +30% |
| N=32 corner scaling | Data efficiency | 54% | 33% | +21% |
| Distractor robustness | Visual clutter | 28% | 10% | +18% |
Zero-shot viewpoint generalization: trained at default camera only, evaluated across 64 viewpoints.
Green = success, Red = failure.
Method
PARA decomposes end-effector action prediction into pixel localization and height estimation—two steps that are naturally equivariant to spatial and viewpoint changes.
Standard policies (e.g., ACT) regress actions from a global CLS token, forcing the model to implicitly solve correspondence, geometry, and control in an unstructured output space.
PARA decomposes this into three steps:
Gripper open/close and rotation are predicted by indexing the feature map at the predicted pixel location.
Diffuse, uncertain heatmaps early in training
Sharp, peaked predictions after convergence
Result 1
Trained on one half of the workspace, tested on the other. PARA generalizes to unseen positions; ACT memorizes training locations.
54%vs1% — 128 demos, trained left, tested right
Green = train (left half), Blue = test (right half)
Far-right position — never seen, still succeeds
Same position — ACT reaches to memorized training location
46%vs7% — trained near, tested far
Green = train (near half), Blue = test (far half)
Result 2
Trained at the default camera. Evaluated across 64 viewpoints up to 25° off-axis. PARA holds steady; ACT collapses.
PARA holds ~62% through 18° then gracefully degrades. ACT drops from 67% to 0% by 18°.
61%vs24%
Green = single train viewpoint
Blue = 64 test viewpoints
40%vs10% — trained left hemisphere, tested right
Green = left hemi train
Blue = right hemi test
Train vs test viewpoint frames
Result 3
Video diffusion models predict future pixel states. PARA reads off actions in that same pixel-aligned space, requiring far fewer demonstrations.
Two-stage recipe: (1) pretrain video diffusion UNet for 4K steps, (2) jointly fine-tune UNet + PARA heads for 3K steps.
| Model | Steps | Success |
|---|---|---|
| Joint from scratch | 10K | 55% |
| Frozen UNet + PARA only | 4K + 12K | 0% |
| Two-stage: video 4K → joint 3K (PARA) | 7K | 92% |
| Two-stage: video 4K → joint 2K (Global Regression) | 6K | 0% |
Frame 0
Frame 2
4×3 grid of evaluation rollouts. Same video backbone, same two-stage training. Only difference: PARA spatial heatmap vs avg pool + MLP.
11/12 success — PARA reads actions in pixel space
0/12 success — global head cannot learn precise positions
Temporal alignment between generated video frames and PARA heatmap predictions
Additional Experiments
28%vs10% — trained clean, tested with visual clutter
Clean training → cluttered test
54%vs33%
N=32 training demo distribution
Sample demonstrations from the LIBERO simulation benchmark.
Demo 0
Demo 128
Demo 255
Details
Both PARA and ACT use a DINOv2 ViT-S/16 backbone. All experiments use teleport servo evaluation, zero rotation, clean scenes, and ~10 minutes of training. Object position dataset: 16×16 grid (256 demos) across a 39 cm × 60 cm workspace. Viewpoint dataset: 8×8 viewpoint grid (θmax=25°) × 10 demos per viewpoint.