(a) Architecture Comparison
Global Regression (ACT)
Image
→
DINO
→
CLS
token
→
MLP
→
(x,y,z)
PARA (ours)
Image
→
DINO
→
Heatmap
volume
→
Argmax
+ height
→
3D Point
CLS collapses spatial structure.
PARA preserves it.
(b) Where PARA Helps
◎
OOD Generalization
Shifted camera & object positions
▶
Video Backbone
SVD video diffusion → robot actions
⚑
Cross-Embodiment
Arm-deleted point track pretraining
(c) Headline Results
97%
vs
9%
Real Robot · 20 demos · 3 tasks
90%
vs
0%
SVD Video Backbone
TBD
Point Track Pretraining