# vid_model — outbox

## 2026-05-18 — All-mac-teleop Motion-LoRA SVD finetune (20 sessions / 176 clips)

**Status:** running on GPUs 1,2,3,4 via DDP. GPU 0 (other user's 40GB job), GPU 5 (para training), GPU 6 (other user's 10GB job) untouched.
**Wandb run:** https://wandb.ai/cameronsmithbusiness/SVD-motion-lora/runs/2tjuwn7u
**Run name:** `svd_mac_teleop_all_20260518`
**Process:** pid 1713039; ~39 GB per rank.

**Dataset:** 176 MP4s under `Motion-LoRA/dataset/mac_teleop_all_20260518/`. Built from the 20 sessions tagged "mac teleop" in the data viewer (per data_visualizer 2026-05-18 reply). 17 sessions have `rgb_overlay/episodes.json` → one MP4 per episode (173 clips). 3 sessions (`robot_pick_cup`, `umi_pick_cup`, `dataset_20260510_173415`) had no episodes.json → one whole-sequence MP4 each (so the trainer still sees random 7-frame windows from them).
- Build script: `Motion-LoRA/dataset/mac_teleop_all_20260518_build.py`
- Validation start-frames (1 per session): `Motion-LoRA/dataset/mac_teleop_all_20260518_val/`

**Recipe:** identical to the towel+UMI run — `--num_frames=7`, 576×320, lr 5e-5, bf16, 8-bit Adam, gradient checkpointing, per_gpu_batch_size=4, max_train_steps=4000. `NCCL_P2P_DISABLE=1`, port 29511.

**Logs:** `Motion-LoRA/logs/mac_teleop_all_20260518.log`
**Output dir:** `Motion-LoRA/output_mac_teleop_all_20260518/`

**Session "mac teleop" inventory** (per data_visualizer 2026-05-18):
umi_fold_towel (26), dataset_20260501_180125 (2), dataset_20260505_114857 (7), dataset_20260506_124503 (11), dataset_20260506_151912 (11), robot_pick_cup (whole-seq), umi_pick_cup (whole-seq), dataset_20260509_105300 (6), dataset_20260509_170535 (18), dataset_20260510_162906 (9), dataset_20260510_173415 (whole-seq), dataset_20260510_173718 (2), dataset_20260510_181313 (12), dataset_20260510_204602 (5), dataset_20260510_225914 (3), dataset_20260510_235505 (3), dataset_20260511_002247 (19), dataset_20260511_133840 (4), dataset_20260511_153242 (13), dataset_20260511_185505 (22). One id `first_mobile_collection` in the experiment has no datasets.json entry; data_visualizer flagged it as stale — skipped.

Reproduce:
```
cd /data/cameron/vidgen/svd_motion_lora/Motion-LoRA
conda activate motionlora
NCCL_P2P_DISABLE=1 NCCL_DEBUG=WARN \
WANDB_NAME=svd_mac_teleop_all_20260518 WANDB_PROJECT=SVD-motion-lora \
accelerate launch \
  --multi_gpu --num_processes=4 --gpu_ids=1,2,3,4 \
  --mixed_precision=bf16 --main_process_port=29511 \
  train_svd.py \
  --pretrained_model_name_or_path=checkpoints/stable-video-diffusion-img2vid-xt-1-1 \
  --dataset_path=dataset/mac_teleop_all_20260518 \
  --validation_image_path=dataset/mac_teleop_all_20260518_val \
  --num_frames=7 --width=576 --height=320 \
  --per_gpu_batch_size=4 --max_train_steps=4000 \
  --learning_rate=5e-5 --use_8bit_adam --gradient_checkpointing \
  --mixed_precision=bf16 --report_to=wandb \
  --num_validation_images=4 --validation_steps=200 \
  --checkpointing_steps=500 --seed=42 \
  --output_dir=output_mac_teleop_all_20260518
```

---

## 2026-05-09 — Keyframe-video Motion-LoRA RERUN on dataset_20260509_170535 (18 episodes)

**Status:** running on GPUs 0,1,2,3 via DDP. GPU 5 (para training) and GPU 9 (sister keyframe-PARA — restarted with new pid 2937219) untouched.
**Wandb run:** https://wandb.ai/cameronsmithbusiness/SVD-motion-lora/runs/lufbvj63
**Run name:** `svd_smith300_keyframes_w4_d20260509_170535`
**Process:** pid 2947264; ~28 GB per rank.

**Dataset:** 101 4-frame MP4s under `Motion-LoRA/dataset/smith300_keyframes_w4_d20260509_170535/`. 18 episodes × sliding 4-keyframe windows (window counts per ep: 6,6,7,6,6,5,5,5,6,5,5,6,4,7,4,6,6,6 = 101). Build script: `dataset/smith300_keyframes_w4_d20260509_170535_build.py`. Validation start frames in `dataset/smith300_keyframes_w4_d20260509_170535_val/`.

**Recipe:** identical to the 2026-05-09 6-episode keyframe run — `--num_frames=4`, 576×320, lr 5e-5, bf16, 8-bit Adam, gradient checkpointing, per_gpu_batch_size=4, max_train_steps=4000. `NCCL_P2P_DISABLE=1`, port 29510.

**Logs:** `Motion-LoRA/logs/smith300_keyframes_w4_d20260509_170535.log`
**Output dir:** `Motion-LoRA/output_smith300_keyframes_w4_d20260509_170535/`

**Prior 6-episode run** (`svd_smith300_keyframes_w4_20260509` / pid 876627) was already exited — nothing needed killing.

Reproduce:
```
cd /data/cameron/vidgen/svd_motion_lora/Motion-LoRA
conda activate motionlora
NCCL_P2P_DISABLE=1 NCCL_DEBUG=WARN \
WANDB_NAME=svd_smith300_keyframes_w4_d20260509_170535 WANDB_PROJECT=SVD-motion-lora \
accelerate launch \
  --multi_gpu --num_processes=4 --gpu_ids=0,1,2,3 \
  --mixed_precision=bf16 --main_process_port=29510 \
  train_svd.py \
  --pretrained_model_name_or_path=checkpoints/stable-video-diffusion-img2vid-xt-1-1 \
  --dataset_path=dataset/smith300_keyframes_w4_d20260509_170535 \
  --validation_image_path=dataset/smith300_keyframes_w4_d20260509_170535_val \
  --num_frames=4 --width=576 --height=320 \
  --per_gpu_batch_size=4 --max_train_steps=4000 \
  --learning_rate=5e-5 --use_8bit_adam --gradient_checkpointing \
  --mixed_precision=bf16 --report_to=wandb \
  --num_validation_images=4 --validation_steps=200 \
  --checkpointing_steps=500 --seed=42 \
  --output_dir=output_smith300_keyframes_w4_d20260509_170535
```

---

## 2026-05-09 — Keyframe-video Motion-LoRA SVD finetune (dataset_20260509_105300)

**Status:** running on GPUs 0,1,2,3 via DDP. GPU 5 (para training) and GPU 9 (sister keyframe-PARA `smith300_kf_w4_d20260509`) both untouched.
**Wandb run:** https://wandb.ai/cameronsmithbusiness/SVD-motion-lora/runs/p8ju1u4f
**Run name:** `svd_smith300_keyframes_w4_20260509`
**Process:** pid 876627; per-rank ~28 GB on physical GPUs 0/1/2/3.

**Dataset construction:** 19 4-frame MP4s under `Motion-LoRA/dataset/smith300_keyframes_w4_20260509/`, each MP4 = one (episode, sliding-keyframe-window) sample. Mirror of `Smith300TrajectoryDataset(use_keyframes=True)` from `/data/cameron/para/para_mac/data_smith300_para.py`:
- Per episode, take keyframe frame indices in [ep_start, ep_end].
- If `len(kf) < n_window`, pad with `[kf[-1]] * (n_window - len(kf))` — none of the 6 episodes triggered (min keyframes = 5 ≥ 4).
- Sliding window of size 4 over the keyframe list → 3+4+3+3+2+4 = **19 samples** total.
- Build script: `Motion-LoRA/dataset/smith300_keyframes_w4_20260509_build.py`.
- Validation start frames: `Motion-LoRA/dataset/smith300_keyframes_w4_20260509_val/` (6 jpgs, one first-keyframe per episode).

**Recipe (only diff vs prior towel+UMI run is num_frames=4):**
- 4 frames @ 576×320, per_gpu_batch_size=4, max_train_steps=4000
- lr 5e-5, bf16, 8-bit Adam, gradient checkpointing, seed 42
- LoRA rank 128, validation_steps=200, num_validation_images=4, checkpointing_steps=500
- base SVD-XT-1-1, no `--pretrain_unet`
- `NCCL_P2P_DISABLE=1` (mandatory on this box for DDP), main_process_port=29509

**Logs:** `Motion-LoRA/logs/smith300_keyframes_w4_20260509.log`
**Output dir:** `Motion-LoRA/output_smith300_keyframes_w4_20260509/`

Reproduce:
```
cd /data/cameron/vidgen/svd_motion_lora/Motion-LoRA
conda activate motionlora
NCCL_P2P_DISABLE=1 NCCL_DEBUG=WARN \
WANDB_NAME=svd_smith300_keyframes_w4_20260509 WANDB_PROJECT=SVD-motion-lora \
accelerate launch \
  --multi_gpu --num_processes=4 --gpu_ids=0,1,2,3 \
  --mixed_precision=bf16 --main_process_port=29509 \
  train_svd.py \
  --pretrained_model_name_or_path=checkpoints/stable-video-diffusion-img2vid-xt-1-1 \
  --dataset_path=dataset/smith300_keyframes_w4_20260509 \
  --validation_image_path=dataset/smith300_keyframes_w4_20260509_val \
  --num_frames=4 --width=576 --height=320 \
  --per_gpu_batch_size=4 --max_train_steps=4000 \
  --learning_rate=5e-5 --use_8bit_adam --gradient_checkpointing \
  --mixed_precision=bf16 --report_to=wandb \
  --num_validation_images=4 --validation_steps=200 \
  --checkpointing_steps=500 --seed=42 \
  --output_dir=output_smith300_keyframes_w4_20260509
```

---

## 2026-05-07 — Towel + UMI joint Motion-LoRA SVD finetune

**Status:** running on GPUs 0,1,2,3 via DDP; GPU 5 (para training) untouched.
**Wandb run:** https://wandb.ai/cameronsmithbusiness/SVD-motion-lora/runs/jhcvddj6
**Wandb project:** https://wandb.ai/cameronsmithbusiness/SVD-motion-lora
**Run name:** `svd_towel_umi_combined_20260507`
**Process:** pid 2071709 (4 worker subprocs at ~39 GB each on physical GPUs 0-3, uuids edd60fb8/dcf0d006/6f205a2b/6ac5c8bf).

**Codebase:** `/data/cameron/vidgen/svd_motion_lora/Motion-LoRA/train_svd.py` — Motion-LoRA only, never the in-repo `para/video_training/svd_finetune/train.py`. Logged as durable feedback memory `feedback_video_run_motion_lora.md`.
**Conda env:** `motionlora`.

**Datasets (joint, combined into one mp4 dir):**
- `/data/cameron/mac_robot_datasets/dataset_20260506_124503/` — towel folding, 11 episodes
- `/data/cameron/mac_robot_datasets/dataset_20260506_151912/` — UMI handheld, 11 episodes
- 22 episodes total, all included (none too short for num_frames=7).
- Combined MP4 dir: `dataset/towel_umi_combined_20260507/`
- Validation start-frames: `dataset/towel_umi_combined_20260507_val/`
- Built via `dataset/towel_umi_combined_20260507_build.py`.

**Recipe (matches the April-2 setup that produced the 90% libero result):**
- 7 frames @ 576×320 (mac_robot_datasets are 960×540 → resized down)
- per_gpu_batch_size=4, 4 GPUs DDP → effective batch 16 per step
- max_train_steps=4000
- lr 5e-5, bf16, 8-bit Adam, gradient checkpointing, seed 42
- LoRA rank 128 (train_svd.py argparse default)
- validation_steps=200, num_validation_images=4, checkpointing_steps=500
- base SVD-XT-1-1, no `--pretrain_unet` (avoid libero artifacts on real-arm data)

**Logs:** `/data/cameron/vidgen/svd_motion_lora/Motion-LoRA/logs/towel_umi_combined_20260507.log`
**Output dir:** `/data/cameron/vidgen/svd_motion_lora/Motion-LoRA/output_towel_umi_combined_20260507/`

**Critical env requirement (April-2 report flagged it; I missed it on the first launch and lost ~12 min):**
```
NCCL_P2P_DISABLE=1
```
Without this, multi-GPU DDP on this machine hangs at 100% GPU util in NCCL collectives, post-rank-init, with no log progress. First multi-GPU launch (port 29507) hung; relaunched on port 29508 with the env var and it started in ~25 s.

**Para training preserved (GPU 5):** uuid `e074c8b2`, 3.3 GB, untouched throughout the kill/relaunch.

Reproduce (active run):
```
cd /data/cameron/vidgen/svd_motion_lora/Motion-LoRA
conda activate motionlora
NCCL_P2P_DISABLE=1 NCCL_DEBUG=WARN \
WANDB_NAME=svd_towel_umi_combined_20260507 WANDB_PROJECT=SVD-motion-lora \
accelerate launch \
  --multi_gpu --num_processes=4 --gpu_ids=0,1,2,3 \
  --mixed_precision=bf16 --main_process_port=29508 \
  train_svd.py \
  --pretrained_model_name_or_path=checkpoints/stable-video-diffusion-img2vid-xt-1-1 \
  --dataset_path=dataset/towel_umi_combined_20260507 \
  --validation_image_path=dataset/towel_umi_combined_20260507_val \
  --num_frames=7 --width=576 --height=320 \
  --per_gpu_batch_size=4 --max_train_steps=4000 \
  --learning_rate=5e-5 --use_8bit_adam --gradient_checkpointing \
  --mixed_precision=bf16 --report_to=wandb \
  --num_validation_images=4 --validation_steps=200 \
  --checkpointing_steps=500 --seed=42 \
  --output_dir=output_towel_umi_combined_20260507
```

---

## 2026-05-03 — SVD finetune on Smith300 (Motion-LoRA recipe, base SVD warm-start) [completed]

**Wandb run:** https://wandb.ai/cameronsmithbusiness/SVD-motion-lora/runs/3fizi51f
**Run name:** `svd_smith300_stage1_motionlora_baseSVD`

Single-GPU Motion-LoRA finetune on smith300_20260501_180125 (2 episodes). Same recipe with per_gpu_batch_size=2, single GPU 8 via `--gpu_ids=8`. Base SVD warm-start.

Three earlier dead runs from that day:
- `svd_finetune/runs/z68s6urb` — in-repo script, vis upload bug (torchvision.io.write_video chokes on pict_type=NONE).
- `svd_finetune/runs/s07cb1pn` — in-repo script with vis fix; killed when user asked to switch to Motion-LoRA recipe.
- `SVD-motion-lora/runs/kcj2lztw` — first Motion-LoRA attempt, option-(a) libero warm-start AND on GPU 0 by accident (single_gpu.yaml `gpu_ids: 0,` clobbered `CUDA_VISIBLE_DEVICES=8`). Killed for both reasons.

**GPU placement fix:** pass `--gpu_ids=N` to `accelerate launch` directly (overrides config). `CUDA_VISIBLE_DEVICES` env alone is not enough.
