Pixel-Aligned Robot Actions
Are More Data-Efficient and Robust to New Viewpoints and Environments
We reformulate robot actions as a per-pixel regression problem, enabling dramatically improved data efficiency and generalization compared to global coordinate regression, thanks to dense supervision and shift equivariance.
| Experiment | OOD Axis | PARA | ACT | Delta |
|---|---|---|---|---|
| (Sim) Left → Right extrapolation | Object position | 54% | 1% | +53% |
| (Sim) Near → Far extrapolation | Object position | 46% | 7% | +39% |
| (Sim) Default → All viewpoints | Camera viewpoint | 61% | 24% | +37% |
| (Real) Pick and Place | Data efficiency | 97% | 9% | +88% |
| (Real) Fold Towel | Data efficiency | 97% | 11% | +86% |
| (Real) Wipe Table | Data efficiency | 95% | 0% | +95% |
| (Real) New Viewpoint | Camera viewpoint | 52% | 0% | +52% |
| (Real) New Viewpoint + 5ep F.T. | Camera viewpoint | 87% | 4% | +83% |
| (Real) New Environment | Visual transfer | 94% | 0% | +94% |
Method
PARA decomposes end-effector action prediction into pixel localization and height estimation—two steps that are naturally equivariant to spatial and viewpoint changes.
Standard policies (e.g., ACT) regress actions from a global CLS token, forcing the model to implicitly solve correspondence, geometry, and control in an unstructured output space.
PARA decomposes this into three steps:
Gripper open/close and rotation are predicted by indexing the feature map at the predicted pixel location.
Real-Robot Experiments
Same viewpoint and environment. 20 demos per task. PARA vs ACT on a real SO-100 arm.
97%
9%
97%
11%
95%
0%
Pick-and-place task. Trained at one viewpoint with 20 demos. Tested zero-shot at a new viewpoint, after 5-episode fine-tuning, and in a completely new environment.
52%
0%
87%
4%
94%
0%
Result 1
Trained on one half of the workspace, tested on the other. PARA generalizes to unseen positions; ACT memorizes training locations.
54%vs1% — 128 demos, trained left, tested right
46%vs7% — trained near, tested far
Result 2
Trained at the default camera. Evaluated across 64 viewpoints up to 25° off-axis. PARA holds steady; ACT collapses.
PARA holds ~62% through 18° then gracefully degrades. ACT drops from 67% to 0% by 18°.
61%vs24%
Result 3
Video diffusion models predict future video. How to use the video model features as a robot policy? PARA pixel-aligned regression head closely aligns robot actions to video rollout, whereas global regression (CNN) is less aligned and misses grasps.
Evaluation rollouts. Same video backbone, same two-stage training. Only difference: PARA spatial heatmap vs avg pool + MLP.
11/12 success — PARA reads actions in pixel space
0/12 success — global head cannot learn precise positions
Result 4
Pretrain on circle-overlay data (robot hidden, orange circle marks EEF), then fine-tune on real robot demos. PARA’s pixel-aligned head transfers pretraining effectively; ACT’s global head does not.