Single-scene closed-loop — 3 model variants

Same libero_spatial task 0 demo-0 init state. 1 episode, max_steps=600, --teleport --zero_rotation. All three models hit max_steps (0% SR).

Modelval_px_errSRpred jerk (px)Video
(A) 2D AR8.0 (3 ep)0%1.89 ▶ play A
(B) Voxel + abs xyz12.4 (2 ep pilot)0%0.25 ▶ play B
(C) Voxel + EEF-rel xyz13.3 (2 ep pilot)0%0.06 ▶ play C

White dot = current EEF projected to image. Green crosshair = predicted next-EEF cell. Label shows step idx and gripper sign.

Takeaways

All 3 confirm the under-training pathology: predicted target stays in the current 8-px grid cell, action deltas are 1-2mm/step, nothing happens before max_steps. The voxel models are MORE frozen (jerk 7–30× lower) than the 2D AR — consistent with them being 2 epochs vs 3 epochs and (for C) hitting the rel-PE unit-mismatch bug I flagged.