OOD Generalization — PARA vs ACT

Compared PARA (pixel-aligned heatmap) vs ACT (CLS-token regression) on out-of-distribution generalization across object positions, camera viewpoints, and visual distractors. Both models use DINOv2 ViT-S/16 backbone, teleport servo evaluation, zero rotation, clean scene, 10 min training.

Summary Results

Key experiments — largest PARA vs ACT deltas
Experiment	OOD Axis	PARA	ACT	Delta
Left → Right position extrapolation	Object position	54%	1%	+53%
Near → Far position extrapolation	Object position	46%	7%	+39%
Default → All viewpoints (zero-shot)	Camera viewpoint	61%	24%	+37%
Distractor robustness	Visual clutter	60%	40%	+20%

Stage indicators in eval videos: ✔ PLACE = full success (green), ↑ GRASP = bowl lifted but not placed (yellow), ✘ MISS = didn't grasp (red).

Training Data

Object position dataset: 16×16 grid (256 trajectories) across a 39cm × 60cm workspace. Viewpoint dataset: 8×8 viewpoint grid (θ_max=25°) with random positions. Natural-start servo replay, clean scene, 448×448 RGB.

1. Left → Right Position Extrapolation (PARA 54% vs ACT 1%)

Train on left half of position grid (128 demos), test on right half. PARA's pixel-aligned heatmap follows the object. ACT's absolute coordinate regression cannot extrapolate.

Green = 128 training positions (left half). Blue = 20 test positions (right half).

PARA — 5×5 OOD test positions. Green = placed, yellow = grasped, red = missed.

ACT — same positions. Almost all red (miss).

2. Near → Far Position Extrapolation (PARA 46% vs ACT 7%)

Train on near half (closer to robot), test on far half. Tests depth-axis extrapolation.

Green = 128 training positions (near half). Blue = 20 test positions (far half).

PARA — 5×5 OOD test positions across the far half.

ACT — same positions. Mostly misses, some grasps but almost no placements.

3. Zero-Shot Viewpoint Generalization (PARA 61% vs ACT 24%)

Both models trained at the default camera viewpoint (θ=0°) with 64 diverse object positions. Evaluated across the full 8×8 viewpoint grid with random positions. PARA maintains ~62% through θ=17.9°. ACT collapses to 0% after θ=14°.

Per-θ breakdown (averaged over all φ)
θ	0° (train)	3.6°	7.1°	10.7°	14.3°	17.9°	21.4°	25°
PARA	88%	79%	62%	63%	62%	62%	33%	38%
ACT	67%	54%	42%	17%	12%	0%	0%	0%

Polar plot: single green train dot at θ=0°, blue test dots across full 8×8 grid.

PARA — 5×5 viewpoint eval grid. Rows = θ, columns = φ.

ACT — same viewpoint grid. Mostly failures at non-default viewpoints.

4. Distractor Robustness (PARA 60% vs ACT 40%)

Models trained on clean scenes (N=64 positions), tested at default LIBERO position with distractor objects and furniture present.

Left: clean training scenes. Right: cluttered test scenes with distractors and furniture.

PARA — default position with distractors + furniture

ACT — same scene with distractors

Analysis

PARA's pixel-aligned formulation provides three key advantages: 1. Position extrapolation: PARA's heatmap follows the object wherever it appears (54% vs 1% left→right, 46% vs 7% near→far). ACT memorizes absolute coordinates and cannot extrapolate. 2. Viewpoint robustness: PARA predicts in image space then recovers 3D via camera geometry. When the camera moves, the heatmap shifts with the visual appearance (61% vs 24% zero-shot, ACT drops to 0% at θ>14°). 3. Distractor robustness: PARA's local pixel predictions are less disrupted by irrelevant visual clutter (60% vs 40%). ACT's advantage: with full coverage (all positions/viewpoints in training), ACT matches or exceeds PARA — global regression is effective when training covers the test distribution.

5. Point Track Pretraining (Circle → Robot Transfer)

Can PARA be pretrained from 2D point tracking alone — no robot, no 3D, no gripper — and transfer to real robot scenes? We pretrain PARA with N_HEIGHT_BINS=1 (2D heatmap only) on scenes where the robot is completely hidden and replaced with a colored circle at the EEF position. The model learns to predict where the circle will move. Then we test: does this pretrained model produce meaningful heatmaps when shown a real robot scene it has never seen?

Training Data: Circle Overlay (No Robot)

All robot geoms hidden. An orange circle is drawn at the projected EEF pixel. 256 demos across the full position grid. The model only sees: table + objects + moving circle.

Sample training frame: orange circle marks EEF position. No robot visible at all.

Cross-Domain Heatmap Transfer

The circle-pretrained model evaluated on both training scenes (circle overlay) and unseen robot scenes (zero-shot). Top 3 rows: circle scenes (training distribution). Bottom 3 rows: real robot scenes (never seen during training). Each row shows: input + start keypoint (green dot) → predicted heatmaps for t+1 through t+4.

Circle-pretrained models evaluated on circle scenes (top half) and robot scenes (bottom half, zero-shot). PARA (heatmap rows): sharp spatial predictions on circles, diffuse but meaningful on robots. ACT (pixel dots rows): predicts trajectory points on circles, less coherent on robots. Each row shows input + 4 future timestep predictions.

Fine-tuning Results: 2D Pretraining (Heatmap Only)

After pretraining on 256 circle-overlay demos with 2D heatmap objective only (no height bins, no gripper), fine-tune on real robot demos. The pretrained backbone provides a 4× improvement over training from scratch in the low-data regime.

2D pretraining (heatmap only) vs scratch across demo counts
Demos	PARA 2D-pt	PARA scratch	ACT 2D-pt	ACT scratch
1	10%	0%	0%	0%
3	20%	7%	0%	0%
5	33%	7%	0%	0%
10	39%	10%	18%	6%
20	31%	29%	24%	44%
256	83% (upper bound)		—

PARA 2D-pretrained dominates in the very low-data regime (1-5 demos): 10-33% success while everything else gets 0-7%. The 2D heatmap pretraining objective transfers directly to the 2D component of the volume head, so the model only needs to learn height prediction from robot data.

Fine-tuning Results: 3D Pretraining (Full Model)

Can we do better by pretraining with the full 3D objective (same architecture, same heads, same losses — only the visual input differs)? We pretrain PARA and ACT on circle-overlay data with full 3D supervision (volume + height + gripper for PARA, position + gripper for ACT), then fine-tune on real robot demos. This gives direct weight transfer — every layer is initialized from pretraining.

Full comparison: 3D pretraining vs 2D pretraining vs scratch (binary success = full place)
Demos	PARA 3D-pt	PARA 2D-pt	PARA scratch	ACT 3D-pt	ACT 2D-pt	ACT scratch
1	0%	10%	0%	0%	0%	0%
3	0%	20%	7%	0%	0%	0%
5	21%	33%	7%	0%	0%	0%
10	66%	39%	10%	21%	18%	6%
20	64%	31%	29%	72%	24%	44%
256	83% (upper bound)			—

Multistage scoring: place=100%, grasp=50%, miss=0% — 2D vs 3D pretraining
Demos	PARA 2D-pt			PARA 3D-pt			ACT 2D-pt			ACT 3D-pt
	P	G	Wtd	P	G	Wtd	P	G	Wtd	P	G	Wtd
1	3	18	12%	0	0	0%	0	0	0%	0	0	0%
3	5	3	6%	0	3	2%	1	7	4%	0	0	0%
5	15	41	36%	21	42	42%	2	3	4%	0	0	0%
10	3	62	34%	66	3	68%	24	31	40%	21	16	29%
20	39	27	52%	64	1	64%	50	32	66%	72	9	76%

Two pretraining regimes emerge:
1. Very low data (1-3 demos): 2D pretraining wins. PARA 2D-pt grasps with just 1 demo (18 grasps, 12% weighted) while 3D-pt gets 0%. The simpler heatmap-only objective transfers more easily — the model only needs to learn height prediction from the few robot demos.
2. Moderate data (5-20 demos): 3D pretraining dominates. PARA 3D-pt reaches 68% weighted at 10 demos vs 34% for 2D-pt. ACT 3D-pt reaches 76% weighted at 20 demos — the best overall result.

The crossover is at 5 demos (PARA 3D-pt 42% vs 2D-pt 36% weighted).

Critical insight — 3D pretraining converts grasps into placements: At 10 demos, PARA 2D-pt grasps 65% of the time (3 places + 62 grasps) but almost never completes placement (3% binary). PARA 3D-pt grasps at a similar rate (69%) but converts nearly all grasps to full placements (66% binary). The 3D pretraining teaches the height/placement skills that 2D pretraining leaves to be learned from scratch. This is the key benefit: both pretraining methods learn the 2D spatial reasoning needed to reach the object, but only 3D pretraining provides the depth and gripper skills needed to complete the manipulation.