PARA

Pixel-Aligned Robot Actions

Are More Data-Efficient and Robust to New Viewpoints and Environments

We reformulate robot actions as a per-pixel regression problem, enabling dramatically improved data efficiency and generalization compared to global coordinate regression, thanks to dense supervision and shift equivariance.

50% vs 4%
Spatial extrapolation
61% vs 24%
Viewpoint generalization
92% vs 0%
Video backbone + PARA
PARA project overview — pipeline and OOD robustness
Experiment OOD Axis PARA ACT Delta
(Sim) Left → Right extrapolationObject position54%1%+53%
(Sim) Near → Far extrapolationObject position46%7%+39%
(Sim) Default → All viewpointsCamera viewpoint61%24%+37%
(Real) Pick and PlaceData efficiency97%9%+88%
(Real) Fold TowelData efficiency97%11%+86%
(Real) Wipe TableData efficiency95%0%+95%
(Real) New ViewpointCamera viewpoint52%0%+52%
(Real) New Viewpoint + 5ep F.T.Camera viewpoint87%4%+83%
(Real) New EnvironmentVisual transfer94%0%+94%

How PARA Works

PARA decomposes end-effector action prediction into pixel localization and height estimation—two steps that are naturally equivariant to spatial and viewpoint changes.

PARA pipeline: RGB → DINO → features → conv → ray logits → unproject Height prediction is view-invariant

Standard policies (e.g., ACT) regress actions from a global CLS token, forcing the model to implicitly solve correspondence, geometry, and control in an unstructured output space.

PARA decomposes this into three steps:

  1. Where in the image and where along the ray? Predict logits along each ray of the image to get a dense 3D heatmap volume in the camera frame. Argmax gives (u, v, z).
  2. 3D recovery. Given (u, v, z), recover the 3D end-effector position via camera pose and intrinsics.

Gripper open/close and rotation are predicted by indexing the feature map at the predicted pixel location.

Key insight: Because localization happens in pixel space, it is naturally equivariant to object translation and camera viewpoint changes. The model doesn’t need to see every position or viewpoint—it just needs to find the object in the image.

Real-Robot Results

Multi-Task Data Efficiency

Same viewpoint and environment. 20 demos per task. PARA vs ACT on a real SO-100 arm.

Pick and Place

PARA

97%

ACT

9%

Fold Towel

PARA

97%

ACT

11%

Wipe Table

PARA

95%

ACT

0%

Robustness to Viewpoint and Environment

Pick-and-place task. Trained at one viewpoint with 20 demos. Tested zero-shot at a new viewpoint, after 5-episode fine-tuning, and in a completely new environment.

New Viewpoint (zero-shot)

PARA

52%

ACT

0%

New Viewpoint + 5ep Fine-Tune

PARA

87%

ACT

4%

New Environment

PARA

94%

ACT

0%

Spatial Extrapolation

Trained on one half of the workspace, tested on the other. PARA generalizes to unseen positions; ACT memorizes training locations.

OOD generalization analysis — spatial + viewpoint

Left → Right Position Extrapolation

54%vs1% — 128 demos, trained left, tested right

Left-right distribution
PARA — 54%
ACT — 1%

Near → Far Position Extrapolation

46%vs7% — trained near, tested far

Near-far distribution
PARA — 46%
ACT — 7%

Viewpoint Robustness

Trained at the default camera. Evaluated across 64 viewpoints up to 25° off-axis. PARA holds steady; ACT collapses.

Per-Theta Breakdown

PARA holds ~62% through 18° then gracefully degrades. ACT drops from 67% to 0% by 18°.

Zero-Shot: Default → All Viewpoints

61%vs24%

Polar overview
PARA — 61%
ACT — 24%
Why does this work? PARA predicts actions in pixel space. When the camera moves, the object moves in the image—but the model just follows it. ACT must learn a mapping from every viewpoint to absolute coordinates, which doesn’t generalize.

Video Models as Policy Backbones

Video diffusion models predict future video. How to use the video model features as a robot policy? PARA pixel-aligned regression head closely aligns robot actions to video rollout, whereas global regression (CNN) is less aligned and misses grasps.

Video backbone pipeline — SVD → PARA vs global regression

PARA Heads vs Global Regression

Evaluation rollouts. Same video backbone, same two-stage training. Only difference: PARA spatial heatmap vs avg pool + MLP.

SVD + PARA — 92%

11/12 success — PARA reads actions in pixel space

SVD + Global Regression — 0%

0/12 success — global head cannot learn precise positions

Point-Track Pretraining

Pretrain on circle-overlay data (robot hidden, orange circle marks EEF), then fine-tune on real robot demos. PARA’s pixel-aligned head transfers pretraining effectively; ACT’s global head does not.

Point-track pretraining — circle overlay training data + fine-tuning results