Compared PARA (pixel-aligned heatmap) vs ACT (CLS-token regression) on out-of-distribution generalization across object positions, camera viewpoints, and visual distractors. Both models use DINOv2 ViT-S/16 backbone, teleport servo evaluation, zero rotation, clean scene, 10 min training.
Summary Results
| Experiment | OOD Axis | PARA | ACT | Delta |
|---|---|---|---|---|
| Left → Right position extrapolation | Object position | 54% | 1% | +53% |
| Near → Far position extrapolation | Object position | 46% | 7% | +39% |
| Default → All viewpoints (zero-shot) | Camera viewpoint | 61% | 24% | +37% |
| Distractor robustness | Visual clutter | 60% | 40% | +20% |
Stage indicators in eval videos: ✔ PLACE = full success (green), ↑ GRASP = bowl lifted but not placed (yellow), ✘ MISS = didn't grasp (red).
Training Data
Object position dataset: 16×16 grid (256 trajectories) across a 39cm × 60cm workspace. Viewpoint dataset: 8×8 viewpoint grid (θ_max=25°) with random positions. Natural-start servo replay, clean scene, 448×448 RGB.
1. Left → Right Position Extrapolation (PARA 54% vs ACT 1%)
Train on left half of position grid (128 demos), test on right half. PARA's pixel-aligned heatmap follows the object. ACT's absolute coordinate regression cannot extrapolate.

Green = 128 training positions (left half). Blue = 20 test positions (right half).
PARA — 5×5 OOD test positions. Green = placed, yellow = grasped, red = missed.
ACT — same positions. Almost all red (miss).
2. Near → Far Position Extrapolation (PARA 46% vs ACT 7%)
Train on near half (closer to robot), test on far half. Tests depth-axis extrapolation.

Green = 128 training positions (near half). Blue = 20 test positions (far half).
PARA — 5×5 OOD test positions across the far half.
ACT — same positions. Mostly misses, some grasps but almost no placements.
3. Zero-Shot Viewpoint Generalization (PARA 61% vs ACT 24%)
Both models trained at the default camera viewpoint (θ=0°) with 64 diverse object positions. Evaluated across the full 8×8 viewpoint grid with random positions. PARA maintains ~62% through θ=17.9°. ACT collapses to 0% after θ=14°.
| θ | 0° (train) | 3.6° | 7.1° | 10.7° | 14.3° | 17.9° | 21.4° | 25° |
|---|---|---|---|---|---|---|---|---|
| PARA | 88% | 79% | 62% | 63% | 62% | 62% | 33% | 38% |
| ACT | 67% | 54% | 42% | 17% | 12% | 0% | 0% | 0% |

Polar plot: single green train dot at θ=0°, blue test dots across full 8×8 grid.
PARA — 5×5 viewpoint eval grid. Rows = θ, columns = φ.
ACT — same viewpoint grid. Mostly failures at non-default viewpoints.
4. Distractor Robustness (PARA 60% vs ACT 40%)
Models trained on clean scenes (N=64 positions), tested at default LIBERO position with distractor objects and furniture present.

Left: clean training scenes. Right: cluttered test scenes with distractors and furniture.
PARA — default position with distractors + furniture
ACT — same scene with distractors
Analysis
PARA's pixel-aligned formulation provides three key advantages: 1. Position extrapolation: PARA's heatmap follows the object wherever it appears (54% vs 1% left→right, 46% vs 7% near→far). ACT memorizes absolute coordinates and cannot extrapolate. 2. Viewpoint robustness: PARA predicts in image space then recovers 3D via camera geometry. When the camera moves, the heatmap shifts with the visual appearance (61% vs 24% zero-shot, ACT drops to 0% at θ>14°). 3. Distractor robustness: PARA's local pixel predictions are less disrupted by irrelevant visual clutter (60% vs 40%). ACT's advantage: with full coverage (all positions/viewpoints in training), ACT matches or exceeds PARA — global regression is effective when training covers the test distribution.