SVD Video Policy + PARA Joint Training — OOD Object Positions

vid_model 2026-04-02 15:55 dd855fd

Joint training of SVD video diffusion model + PARA action heads on LIBERO ood_objpos_task0. SVD generates video (7 frames at 576x320) while PARA heads predict pixel-aligned actions from UNet intermediate features (up_block_1 + up_block_2). Separate learning rates: UNet 1e-6 (preserves video quality), PARA heads 1e-4. Eval with teleport + zero rotation + clean scene. Best result: 18/20 (90%) with two-stage training (4K video pretrain → 3K joint) on ood_objpos_v3. Joint-from-scratch: 11/20 (55%) at 10K steps. Frozen UNet: 0/20 (0%). Global regression baseline (same features, avg pool + MLP): 0/20 — PARA's spatial inductive bias is critical for data-efficient video-to-action.

Training Data

Dataset: /data/libero/ood_objpos_task0/libero_spatial/task_0 — 256 demos, 32 frames each. Clean scene (no distractors/furniture). Frame stride 1 (already subsampled at stride 3 from original). Video: 7 frames at 576x320. PARA: 448x448 with N_HEIGHT_BINS=32, 64x64 heatmap grid. EMA-normalized losses: volume CE + gripper CE + diffusion MSE.

Example training trajectory — ood_objpos_task0 demo 0 (clean scene, stride 3)

Example training trajectory — ood_objpos_task0 demo 10 (different object positions)

Video-PARA Alignment Verification

GT frames with GT keypoint (cyan) on left, SVD generated frames with GT keypoint overlay on right. Across multiple samples and timesteps, the keypoint tracks the EEF correctly in both GT and generated video.

Alignment check: GT (left) vs SVD generated (right) with GT keypoint overlay — 5 samples × 4 timesteps

Test Setup

Closed-loop eval in LIBERO simulator: teleport mode (servo to predicted 3D target), zero rotation, clean scene (no distractors). 5 episodes, 600 max steps per episode. Camera: agentview, proper intrinsics from get_camera_intrinsic_matrix. Model: checkpoint-2000 from joint training (separate LRs: UNet 1e-6, PARA 1e-4).

Results

Eval results on ood_objpos_task0
ModelTraining StepsSuccess RateNotes
SVD+PARA Joint (ckpt-2000, 1-step exec)20000/5 (0%)Only executing first predicted action per replan
SVD+PARA Joint (ckpt-2000, 7-step exec)20001/5 (20%)Executing all 7 predicted actions per replan window
SVD+PARA Joint v3 (ckpt-10000, 7-step exec, 5ep)100003/5 (60%)Retrained on ood_objpos_v3 dataset, 3 GPUs
SVD+PARA Joint v3 (ckpt-10000, 7-step exec, 20ep)1000011/20 (55%)20-episode eval, avg success steps: 153
Frozen UNet + PARA only (ckpt-12000)4K video + 12K PARA0/20 (0%)Video pretrained, UNet frozen during PARA training
Two-stage: video 4K → joint 3K (PARA)4K video + 3K joint18/20 (90%)Video pretrained on v3, then joint training with separate LRs
Two-stage: video 4K → joint 2K (Global Regression)4K video + 2K joint0/20 (0%)Same setup but global avg pool + MLP instead of spatial heatmap
Per-episode breakdown — Two-stage (video 4K → joint 3K)
EpisodeResultSteps
0SUCCESS228
1SUCCESS258
2SUCCESS214
3SUCCESS407
4SUCCESS227
5SUCCESS236
6SUCCESS253
7FAILED600
8SUCCESS237
9SUCCESS256
10SUCCESS224
11SUCCESS211
12SUCCESS227
13SUCCESS254
14FAILED600
15SUCCESS249
16SUCCESS262
17SUCCESS190
18SUCCESS183
19SUCCESS252
Per-episode breakdown — v3 ckpt-10000 (7-step execution)
EpisodeResultSteps
0SUCCESS163
1FAILED600
2SUCCESS163
3SUCCESS153
4FAILED600

Training Visualization (wandb)

GT frames with heatmap (left) vs SVD generated frames with heatmap (right). Cyan = GT keypoint, Red = predicted keypoint.

Step 200 — early training, heatmaps starting to focus

Step 2000 — heatmaps more concentrated on correct region

Eval Rollouts (7-step execution, all 5 episodes)

Episode 0 — FAILED (600 steps). GT after execution (left) vs Generated (right), interleaved at every timestep.

Episode 1 — FAILED (600 steps).

Episode 2 — SUCCESS in 168 steps.

Episode 3 — FAILED (600 steps).

Episode 4 — FAILED (600 steps).

Eval Rollouts — v3 dataset, ckpt-10000 (7-step execution)

Episode 0 — SUCCESS in 163 steps. GT after execution (left) vs Generated (right).

Episode 1 — FAILED (600 steps).

Episode 2 — SUCCESS in 163 steps.

Episode 3 — SUCCESS in 153 steps.

Episode 4 — FAILED (600 steps).

Two-Stage Eval — 18/20 (90%) — video 4K → joint 3K (selected rollouts)

Best approach: pretrain video model on v3 data (4K steps), then joint video+PARA training with separate LRs (3K steps). 18/20 success, avg 240 steps for successes.

Episode 18 — SUCCESS in 183 steps (fastest).

Episode 0 — SUCCESS in 228 steps.

Episode 3 — SUCCESS in 407 steps (slowest success).

Episode 7 — FAILED (600 steps).

Episode 14 — FAILED (600 steps).

4×3 Rollout Grid — SVD+PARA (11/12 = 91.7%)

12-episode grid with multistage tracking. Left panel: simulator state with PARA heatmap overlay. Right panel: SVD generated video. Green border = PLACE (success), yellow = GRASP only, red = MISS. Each episode uses ~4-5 SVD video generation queries.

SVD+PARA 4×3 rollout grid — 11 PLACE, 1 GRASP, 0 MISS (91.7% success)

4×3 Rollout Grid — SVD+Global Regression Baseline (0/12 = 0%)

Same SVD UNet features but replaces PARA spatial heatmap with global avg pool + MLP → direct (x,y,z,gripper) regression. Same two-stage training. Complete failure: the global head cannot learn precise 3D positions without spatial inductive bias.

SVD+Global regression 4×3 rollout grid — 0 PLACE, 0 GRASP, 12 MISS (0% success)

20-Episode Eval — v3 ckpt-10000 (selected rollouts)

11/20 (55%) success. Successes complete in 130-174 steps. Selected videos below.

Episode 19 — SUCCESS in 130 steps (fastest).

Episode 7 — SUCCESS in 140 steps.

Episode 1 — FAILED (600 steps).

Episode 5 — FAILED (600 steps).

Debug Rollout — Full Window Execution

Closed-loop debug rollout executing all 7 predicted actions per replan window. Left: GT simulator state (updated after each action execution). Right: SVD generated frame for that timestep. Heatmap + red keypoint overlaid on both. The robot servos to each predicted 3D target sequentially through the full 7-frame window before replanning from the new observation.

Debug rollout — GT (left) vs Generated (right) with interleaved execution. 20 replans, 7 steps each. Robot moves but lacks precision for task completion at 2000 training steps.

Analysis

Key findings: - Two-stage training is dramatically better: pretrain video model (4K steps) → joint video+PARA (3K steps) achieves 90% vs 55% for joint-from-scratch (10K steps). Total compute is also lower (7K effective steps vs 10K). - PARA spatial head is critical: Global regression baseline (same features, same training, avg pool + MLP) scores 0/20 vs PARA's 18/20. The spatial heatmap decomposition (pixel prediction + height bin → 3D via camera) provides a much stronger inductive bias than direct 3D regression, making PARA a far more data-efficient adapter for video-to-action. - Frozen UNet fails completely (0%): PARA heads cannot learn useful actions from frozen video features alone. Joint co-adaptation of UNet features and PARA heads is essential. - The video pretrain stage gives the UNet features that are already well-adapted to the target domain, so PARA heads get a much better starting point during joint training. - SVD video generation quality preserved with separate LRs (UNet 1e-6, PARA 1e-4) - PARA heatmaps learn to focus on bowl/EEF region quickly (vol loss: 11.7 → 1-3 by step 200) - Executing all 7 predicted actions per replan window (vs just the first) is critical for task completion - Architecture: up_block_1 (1280ch→128) + up_block_2 (640ch→128) → concat 256ch → 3x conv → PARA heads at 64x64 Potential issues: - Camera intrinsics were initially hardcoded (fixed to use proper get_camera_intrinsic_matrix) - NCCL P2P must be disabled on this machine for multi-GPU (NCCL_P2P_DISABLE=1)

Next Steps & Concerns

Next steps: - Train longer (10K-50K steps) and re-eval - Compare with frozen-backbone PARA (detach features, no joint training) as baseline - Compare with DINO backbone PARA on same dataset for apples-to-apples - Sweep separate LR ratios (currently 100:1 PARA:UNet) - Add rotation prediction (currently zero rotation) Concerns: - 576x320 video resolution introduces aspect ratio distortion from 448x448 training images - SVD video model was originally trained on parsed_libero, then fine-tuned on ood_objpos — domain gap may remain - Denoising noise level affects feature quality — currently extracting from random training noise level, not controlled

Reproducibility

# Two-stage training (best approach, 90% success)
cd /data/cameron/vidgen/svd_motion_lora/Motion-LoRA

# Stage 1: Video-only pretrain on v3 data (2hrs, 3 GPUs)
timeout 7200 accelerate launch \
  --config_file scripts/accelerate_configs/multi_gpu_3_joint.yaml \
  train_svd.py \
  --pretrained_model_name_or_path=checkpoints/stable-video-diffusion-img2vid-xt-1-1 \
  --pretrain_unet=output_libero_ood_objpos/checkpoint-31500/unet \
  --per_gpu_batch_size=4 --width=576 --height=320 \
  --dataset_path="dataset/libero_ood_objpos_v3" \
  --num_frames=7 --learning_rate=5e-5 --use_8bit_adam \
  --gradient_checkpointing --mixed_precision="bf16" \
  --output_dir=output_svd_v3_stage1

# Stage 2: Joint video+PARA training (1hr, 3 GPUs)
timeout 3600 accelerate launch \
  --config_file scripts/accelerate_configs/multi_gpu_3_joint.yaml \
  train_svd_para_joint.py \
  --pretrained=checkpoints/stable-video-diffusion-img2vid-xt-1-1 \
  --pretrain_unet=output_svd_v3_stage1/checkpoint-4000/unet \
  --cache_root=/data/libero/ood_objpos_v3 --task_ids=0 \
  --num_frames=7 --batch_size=1 --lr=5e-5 \
  --output_dir=output_svd_v3_stage2_joint

# Eval (20 episodes, clean scene, teleport, zero rotation)
CUDA_VISIBLE_DEVICES=4 python eval_joint.py \
  --checkpoint output_svd_v3_stage2_joint/checkpoint-3000 \
  --n_episodes 20 --clean_scene --max_steps 600