(a) Architecture Comparison
Global Regression (ACT)
Image
DINO
CLStoken
MLP
(x,y,z)
PARA (ours)
Image
DINO
Heatmapvolume
Argmax+ height
3D Point
CLS collapses spatial structure. PARA preserves it.
(b) Where PARA Helps
OOD Generalization
Shifted camera & object positions
Video Backbone
SVD video diffusion → robot actions
Cross-Embodiment
Arm-deleted point track pretraining
(c) Headline Results
97%vs9%
Real Robot · 20 demos · 3 tasks
90%vs0%
SVD Video Backbone
TBD
Point Track Pretraining