OOD Generalization — PARA vs ACT

backbones 2026-04-05

Compared PARA (pixel-aligned heatmap) vs ACT (CLS-token regression) on out-of-distribution generalization across object positions, camera viewpoints, and visual distractors. Both models use DINOv2 ViT-S/16 backbone, teleport servo evaluation, zero rotation, clean scene, 10 min training.

Summary Results

Key experiments — largest PARA vs ACT deltas
ExperimentOOD AxisPARAACTDelta
Left → Right position extrapolationObject position54%1%+53%
Near → Far position extrapolationObject position46%7%+39%
Default → All viewpoints (zero-shot)Camera viewpoint61%24%+37%
Distractor robustnessVisual clutter60%40%+20%

Stage indicators in eval videos: ✔ PLACE = full success (green), ↑ GRASP = bowl lifted but not placed (yellow), ✘ MISS = didn't grasp (red).

Training Data

Object position dataset: 16×16 grid (256 trajectories) across a 39cm × 60cm workspace. Viewpoint dataset: 8×8 viewpoint grid (θ_max=25°) with random positions. Natural-start servo replay, clean scene, 448×448 RGB.

1. Left → Right Position Extrapolation (PARA 54% vs ACT 1%)

Train on left half of position grid (128 demos), test on right half. PARA's pixel-aligned heatmap follows the object. ACT's absolute coordinate regression cannot extrapolate.

Green = 128 training positions (left half). Blue = 20 test positions (right half).

PARA — 5×5 OOD test positions. Green = placed, yellow = grasped, red = missed.

ACT — same positions. Almost all red (miss).

2. Near → Far Position Extrapolation (PARA 46% vs ACT 7%)

Train on near half (closer to robot), test on far half. Tests depth-axis extrapolation.

Green = 128 training positions (near half). Blue = 20 test positions (far half).

PARA — 5×5 OOD test positions across the far half.

ACT — same positions. Mostly misses, some grasps but almost no placements.

3. Zero-Shot Viewpoint Generalization (PARA 61% vs ACT 24%)

Both models trained at the default camera viewpoint (θ=0°) with 64 diverse object positions. Evaluated across the full 8×8 viewpoint grid with random positions. PARA maintains ~62% through θ=17.9°. ACT collapses to 0% after θ=14°.

Per-θ breakdown (averaged over all φ)
θ0° (train)3.6°7.1°10.7°14.3°17.9°21.4°25°
PARA88%79%62%63%62%62%33%38%
ACT67%54%42%17%12%0%0%0%

Polar plot: single green train dot at θ=0°, blue test dots across full 8×8 grid.

PARA — 5×5 viewpoint eval grid. Rows = θ, columns = φ.

ACT — same viewpoint grid. Mostly failures at non-default viewpoints.

4. Distractor Robustness (PARA 60% vs ACT 40%)

Models trained on clean scenes (N=64 positions), tested at default LIBERO position with distractor objects and furniture present.

Left: clean training scenes. Right: cluttered test scenes with distractors and furniture.

PARA — default position with distractors + furniture

ACT — same scene with distractors

Analysis

PARA's pixel-aligned formulation provides three key advantages: 1. Position extrapolation: PARA's heatmap follows the object wherever it appears (54% vs 1% left→right, 46% vs 7% near→far). ACT memorizes absolute coordinates and cannot extrapolate. 2. Viewpoint robustness: PARA predicts in image space then recovers 3D via camera geometry. When the camera moves, the heatmap shifts with the visual appearance (61% vs 24% zero-shot, ACT drops to 0% at θ>14°). 3. Distractor robustness: PARA's local pixel predictions are less disrupted by irrelevant visual clutter (60% vs 40%). ACT's advantage: with full coverage (all positions/viewpoints in training), ACT matches or exceeds PARA — global regression is effective when training covers the test distribution.