PARA

Pixel-Aligned Robot Actions

Predicting actions in pixel space rather than coordinate space gives you spatial robustness for free—and makes video models natural policy backbones.

54% vs 1%
Spatial extrapolation
61% vs 24%
Viewpoint generalization
92%
Video backbone + PARA
Experiment OOD Axis PARA ACT Delta
Left → Right extrapolationObject position54%1%+53%
Near → Far extrapolationObject position46%7%+39%
Default → All viewpointsCamera viewpoint61%24%+37%
Left → Right hemisphereCamera viewpoint40%10%+30%
N=32 corner scalingData efficiency54%33%+21%
Distractor robustnessVisual clutter28%10%+18%

Zero-shot viewpoint generalization: trained at default camera only, evaluated across 64 viewpoints.
Green = success, Red = failure.

PARA — 61%
ACT — 24%

How PARA Works

PARA decomposes end-effector action prediction into pixel localization and height estimation—two steps that are naturally equivariant to spatial and viewpoint changes.

Standard policies (e.g., ACT) regress actions from a global CLS token, forcing the model to implicitly solve correspondence, geometry, and control in an unstructured output space.

PARA decomposes this into three steps:

  1. Where in the image? Predict a dense 2D heatmap over pixels per timestep. Argmax gives (u, v).
  2. How high? Per-pixel logits over height bins along the camera ray. Argmax gives world-frame Z.
  3. 3D recovery. Given (u, v) and height, recover the 3D end-effector position via camera intrinsics.

Gripper open/close and rotation are predicted by indexing the feature map at the predicted pixel location.

Image → Heatmap → (u,v) + Height → 3D
DINOv2 ViT-S/16 backbone · 1×1 conv heads · 448×448 input
12 timesteps · 32 height bins · cross-entropy loss
Key insight: Because localization happens in pixel space, it is naturally equivariant to object translation and camera viewpoint changes. The model doesn’t need to see every position or viewpoint—it just needs to find the object in the image.

Heatmap Predictions: Early vs Late Training

200 steps

Diffuse, uncertain heatmaps early in training

2000 steps

Sharp, peaked predictions after convergence

Spatial Extrapolation

Trained on one half of the workspace, tested on the other. PARA generalizes to unseen positions; ACT memorizes training locations.

Left → Right Position Extrapolation

54%vs1% — 128 demos, trained left, tested right

Left-right distribution

Green = train (left half), Blue = test (right half)

Failure mode contrast: PARA reaches the correct object but sometimes fails the grasp (gripper timing). ACT reaches to memorized training positions—completely wrong locations.
PARASuccess

Far-right position — never seen, still succeeds

ACTFailure

Same position — ACT reaches to memorized training location

Near → Far Position Extrapolation

46%vs7% — trained near, tested far

Near-far distribution

Green = train (near half), Blue = test (far half)

PARASuccess
ACTFailure

Viewpoint Robustness

Trained at the default camera. Evaluated across 64 viewpoints up to 25° off-axis. PARA holds steady; ACT collapses.

Per-Theta Breakdown

PARA holds ~62% through 18° then gracefully degrades. ACT drops from 67% to 0% by 18°.

Zero-Shot: Default → All Viewpoints

61%vs24%

Polar overview

Green = single train viewpoint
Blue = 64 test viewpoints

PARA — 61%
ACT — 24%

Viewpoint Hemisphere: Left → Right

40%vs10% — trained left hemisphere, tested right

Hemisphere polar

Green = left hemi train
Blue = right hemi test

Frame comparison

Train vs test viewpoint frames

PARA
ACT
Why does this work? PARA predicts actions in pixel space. When the camera moves, the object moves in the image—but the model just follows it. ACT must learn a mapping from every viewpoint to absolute coordinates, which doesn’t generalize.

Video Models as Policy Backbones

Video diffusion models predict future pixel states. PARA reads off actions in that same pixel-aligned space, requiring far fewer demonstrations.

Two-stage recipe: (1) pretrain video diffusion UNet for 4K steps, (2) jointly fine-tune UNet + PARA heads for 3K steps.

ModelStepsSuccess
Joint from scratch10K55%
Frozen UNet + PARA only4K + 12K0%
Two-stage: video 4K → joint 3K (PARA)7K92%
Two-stage: video 4K → joint 2K (Global Regression)6K0%
PARA’s spatial inductive bias is critical. Global regression (same video features, avg pool + MLP) scores 0% vs PARA’s 92%. Freezing the backbone also fails—joint co-adaptation of UNet features and PARA heads is essential. Video pretraining provides the right initialization.

Generated vs Ground-Truth Frames

Frame 0

Frame 0

Frame 2

Frame 2

PARA Heads vs Global Regression — Rollout Grids

4×3 grid of evaluation rollouts. Same video backbone, same two-stage training. Only difference: PARA spatial heatmap vs avg pool + MLP.

SVD + PARA — 92%

11/12 success — PARA reads actions in pixel space

SVD + Global Regression — 0%

0/12 success — global head cannot learn precise positions

Video ↔ Action Alignment

Temporal alignment between generated video frames and PARA heatmap predictions

More Results

Distractor Robustness

28%vs10% — trained clean, tested with visual clutter

Clean vs cluttered

Clean training → cluttered test

PARASuccess
ACTFailure

Viewpoint: Inner/Outer Hemisphere

Inner/outer polar
PARA
ACT

Data Efficiency: N=32 Corner Scaling

54%vs33%

N=32 distribution

N=32 training demo distribution

Training Data Samples

Sample demonstrations from the LIBERO simulation benchmark.

Demo 0

Demo 128

Demo 255

Experimental Setup

Both PARA and ACT use a DINOv2 ViT-S/16 backbone. All experiments use teleport servo evaluation, zero rotation, clean scenes, and ~10 minutes of training. Object position dataset: 16×16 grid (256 demos) across a 39 cm × 60 cm workspace. Viewpoint dataset: 8×8 viewpoint grid (θmax=25°) × 10 demos per viewpoint.