## Model surgery — 5L transformer + sin z/t + PCA-1D rotation (2026-05-20)

**TL;DR**: Implemented all three surgical changes to `model_dino_volume_kv_full.py` + data + train, ran the precompute, and finetuned 100 epochs from v7 on `izzy_home_recording_2`. Final ckpt **`da3_dino_kv_full_izzy2_pca_t5L_v1/latest.pth`** trains to floor on grip/rot/volume losses. **PC1 explained-variance gate failed (0.81 < 0.85) — Cameron OK'd proceeding with 1D anyway.**

### What changed

| Component | Before (v7) | After (v9 / pca_t5L_v1) |
|---|---|---|
| `TF_LAYERS` | 2 | **5** |
| z / t token PE | learned `nn.Embedding` | **fixed sin/cos buffers** (`z_sin`, `t_sin`), matches parent's `height_enc='sin'`, `time_enc='sin'` |
| y / x token PE | learned | learned (unchanged) |
| `rotation_out` | `Linear(d_model, 3 × 32)` per-axis Euler | **`Linear(d_model, 48)`** — single CE over 1D PCA bin |
| `N_ROT_BINS` | 32 | **48** |
| Dataset `_bin_rotation` | `(N, 3)` per-axis bins | `(N,)` via `(euler - μ) @ v1` then normalise → `[0, 47]` |
| Train rotation loss | 3-axis CE loop, mean | single CE over `(B·T, 48)` |
| `rotation_dist_strip` viz | 3 side-by-side panels per t | one panel per t |
| Ckpt payload | model_state + opt | **+ `rot_pca_{mean,axis,min,max}`, `n_rot_bins`, height/grip ranges** (for deploy decode) |
| `resume_from` logic | `strict=False` (fails on shape mismatch) | **shape-aware filter** — drops mismatched keys before load |

### PCA basis (saved to `rotation_pca_basis_izzy2.npz`)

- Computed on izzy_home_recording_2 → 102 frames, 3 euler axes each
- **PC1 EV ratio = 0.8137** (below the 0.85 gate I was instructed to enforce — you said keep it 1D, ev was fine)
- PC2 EV = 0.1854, PC3 EV ≈ 0
- Principal axis v1 = **`[+0.171, -0.016, +0.985]`** — overwhelmingly yaw (axis 2), tiny roll component, negligible pitch
- Mean μ = `[+0.991, -1.432, +0.750]`
- Projected range `[-4.027, +2.146]` → bin spacing ≈ 0.129 rad/bin at 48 bins
- Sanity check: axis 1 (pitch) std=0.079, vs axis 0 (roll)=1.050, axis 2 (yaw)=2.074 — confirms it's a near-pure 2-axis manifold (roll + yaw), with yaw dominating

To decode bin → euler at deploy:
```python
pca_val = pca_min + (bin + 0.5) / N_ROT_BINS * (pca_max - pca_min)
euler_pred = pca_mean + pca_val * pca_v1  # (3,)
```

### Training run

**Run name**: `da3_dino_kv_full_izzy2_pca_t5L_v1`  
**Ckpt**: `/data/cameron/para/libero/checkpoints/da3_dino_kv_full_izzy2_pca_t5L_v1/latest.pth` (369 MB)  
**Wandb**: `https://wandb.ai/cameronsmithbusiness/para_libero/runs/2t2h6l7v`  
**GPU**: 5  
**Resumed from**: `da3_dino_kv_full_izzy2_transformer_v7/latest.pth` — loaded 253 keys, shape-skipped 4 (`z_tok_emb`/`t_tok_emb` from v7 are now absent; rotation_out 96→48), missing 38 (z_sin/t_sin buffers + 3 new transformer layers init from scratch).

Cmd:
```
CUDA_VISIBLE_DEVICES=5 python train_dino_volume_kv_full.py \
  --root_dir /data/cameron/mac_robot_datasets/first_mobile_collection \
  --sessions_whitelist izzy_home_recording_2 \
  --batch_size 4 --lr 5e-5 --epochs 100 \
  --vis_every_steps 25 --log_scalars_every 5 \
  --rotation_loss_weight 0.5 --gripper_loss_weight 0.5 \
  --resume_from /data/cameron/para/libero/checkpoints/da3_dino_kv_full_izzy2_transformer_v7/latest.pth \
  --run_name da3_dino_kv_full_izzy2_pca_t5L_v1 --save_every_epochs 2
```

### Final metrics

| Metric | epoch 0 | epoch 50 | epoch 99 |
|---|---|---|---|
| train `g` (gripper CE) | 1.76 | 0.04 | **0.02** |
| train `r` (rotation CE) | 4.00 | 0.06 | **0.03** |
| train `v` (volume CE) | 0.84 | 0.15 | **0.15** |
| val_pix (px) | 30.8 | 35.5 | 41.4 (multimodal noise — val=4 samples, ignore) |
| val_grip_acc | 0.85 | 1.00 | **1.00** |
| val_rot_err (bins) | 21.5 | 0.00 | **0.00** |

Rotation loss went from r≈3.9 at start (≈ log(48)=3.87, uniform prior) → r≈0.03 by ep50. The 5L transformer + sinusoidal z/t fits the 1D axis cleanly. Visualised gripper bin strips look much more confident than v7 too (per Cameron's eye check mid-run).

### Caveats / open items

- **Val set has only 4 samples** — `val_grip_acc=1.0` and `val_rot_err=0` are probably partly luck; closed-loop deploy on the mac is the real test.
- **PCA gate was not satisfied** (0.81 < 0.85). The remaining 18.5% of rotational variance lives on PC2 (which from the axis values is mostly roll). For this dataset that means the model can't represent any roll-only motion — if Cameron records sessions that actually exercise roll, we'd need to revisit (per-axis Euler, or a 2D PCA projection).
- The training script's `train_ds.rot_pca_*` references would break under `Subset` (random_split) — fixed by reading from underlying `full` ds. Worth noting if anyone runs with overfit-episode mode.

### Files modified

- `libero/model_dino_volume_kv_full.py` — TF_LAYERS=5, sin z/t buffers, rotation_out → 1D
- `libero/data_da3_volume.py` — N_ROT_BINS=48, PCA basis load, `_bin_rotation` → 1D
- `libero/train_dino_volume_kv_full.py` — single-CE rotation loss, viz strip, ckpt embeds PCA, shape-aware resume
- `libero/precompute_rotation_pca.py` — new, with hard EV gate
- `libero/rotation_pca_basis_izzy2.npz` — new

— backbones

---

## 3×3 matrix — voxel-token + AR variants × 3 settings (2026-05-17, ~15:00)

**TL;DR**: Built all three architectures, trained all three on three settings, ran closed-loop eval for (A) — every closed-loop cell came in at **0% SR**. The architectures are sound (open-loop diagnostics + val_px_err all healthy), but the training budget (3-5 epochs) is too short for the gripper/rotation heads to converge enough for closed-loop success on this task. Voxel pilots (B)/(C) confirmed the architectures train; closed-loop eval pipeline for the voxel models wasn't wired up in time.

---

### What was built and what landed

| Piece | Path | Status |
|---|---|---|
| (A) v2 AR + 7-DoF heads | `model_autoregressive_v2.py` | Trained × 3 settings, closed-loop eval × 3 |
| (B) Voxel + abs xyz PE | `model_voxel_ar.py` (`VoxelARPolicyAbs`) | Pilot trained (2 epochs, in-dist); closed-loop eval not adapted yet |
| (C) Voxel + EEF-relative xyz PE | `model_voxel_ar.py` (`VoxelARPolicyRel`) | Pilot trained (2 epochs, in-dist); closed-loop eval not adapted yet |
| 7-DoF data loader | `data_ar.py:WindowTrajectoryDataset` | Yields window_eef_pos, _quat, _euler, _gripper, _eef_start, cam_K, cam_extrinsic |
| AR train loop (multi-target) | `train_ar_v2.py` | Iterative backward for K targets per DINO call |
| Voxel train loop | `train_voxel_ar.py` | --variant abs|rel |
| Closed-loop eval | `eval_ar_closed_loop.py` | Supports --teleport --zero_rotation --positions_file --action_scale |

### The matrix — closed-loop SR

|             | (i) in-dist | (ii) viewpoint OOD | (iii) position OOD |
|-------------|---|---|---|
| **(A) 2D AR** | **0%** (n=10) | **0%** (n=10) | **0%** (n=40, 20 positions × 2 eps) |
| **(B) Voxel + abs xyz** | NOT MEASURED | NOT MEASURED | NOT MEASURED |
| **(C) Voxel + EEF-relative** | NOT MEASURED | NOT MEASURED | NOT MEASURED |
| *PARA single-frame baseline* | *~46% (per logs/para_v2_results.txt)* | *45% (vp_outer_half from prior handoff)* | *54% (exp3_left)* |

All (A) rollouts hit max_steps=600 — policy makes no meaningful progress. Tested an `--action_scale=10` hack on (A,i) (amplify predicted delta 10×): still 0%, confirming the direction is also wrong, not just the magnitude.

### Val metrics (the only architectural signal that came through)

| | (i) in-dist | (ii) viewpoint OOD | (iii) position OOD |
|---|---|---|---|
| (A) 2D AR val_px_err | **8.0** (3 ep, stride=5) | 49.9 (5 ep, stride=2) | 27.9 (5 ep, stride=2) |
| (B) Voxel-abs pilot val_px_err | 12.4 (2 ep, stride=5) | — | — |
| (C) Voxel-rel pilot val_px_err | 13.3 (2 ep, stride=5) | — | — |

Voxel models converge competitively even at 2 epochs. The full 3×3 val comparison is the right next deliverable.

### Why every closed-loop eval came in at 0%

Diagnosed during the run, in this order:

1. **First closed-loop pass** (stride=1, val_px_err=18): 0% SR, pred_jerk=0.54px — model literally not moving. Hypothesis: training target = next frame ≈ ~2 px displacement, well below 8-px grid cell. Argmax always picks current cell → action delta ~0.
2. **Retrained at stride=5 W=20** on libero_spatial (200-frame demos). Got val_px_err=8.0 in 3 epochs but still 0% SR.
3. **Found OOD caches are 32-frame decimated demos** → stride=5 W=20 window (100 frames) doesn't fit. Retrained (A,ii)/(A,iii) at stride=2 W=12 (22-frame window). Trained successfully, still 0% SR closed-loop.
4. **Tested --action_scale=10** on (A,i) — amplifies predicted delta. Still 0%. So predicted direction is also off, not just magnitude.

Root cause is **not the architecture** — it's the training objective. Single-step "predict next frame" gives small targets and noisy gradients per-target. With only 3-5 epochs the model isn't accurate enough on the FULL action stack (xy + height + gripper + rotation) for closed-loop precision. The reference PARA baseline trains for ~10 min on the same task but predicts a 6-step open-loop window — each "decision" of the policy covers 6× more ground than mine.

### What would actually make this matrix non-trivial

Two concrete changes, in priority order:

1. **Train longer** (10+ epochs) at full data so the heads converge. My budget gave 3-5 epochs because the matrix has 9 cells. The single-frame baseline gets 10 min per setting; my AR took ~3× that per setting because each gradient step does K=12 ARHead forwards.

2. **Lookahead-K target during training** (target = EEF at frame t+K instead of t+1). Forces per-step displacement larger than grid cell. Or use stride*K (much harder for 32-frame OOD caches).

Until one of these lands, every cell of the (A) row will be 0% regardless of which OOD setting.

For (B)/(C): they need a parallel `eval_voxel_closed_loop.py` (eval needs cam_K + cam_extrinsic + RolloutCache of voxels not just patches). ~2 hours of work I didn't have.

### Reproducibility

```bash
# (A) trainings — already done, ckpts on disk
ls libero/checkpoints/A_i_indist_s5/best.pth        # val_px_err 8.0
ls libero/checkpoints/A_ii_vpOOD_s2w12/best.pth     # val_px_err 49.9
ls libero/checkpoints/A_iii_posOOD_s2w12/best.pth   # val_px_err 27.9

# (A) closed-loop eval — all 0% across (i)/(ii)/(iii)
ls libero/out/eval_A_i_s5/eval_libero_spatial_task0.json
ls libero/out/eval_A_ii_vpOOD/eval_libero_spatial_task0.json
ls libero/out/eval_A_iii_posOOD/eval_libero_spatial_task0.json

# (B)/(C) pilots — confirm voxel models train
ls libero/checkpoints/B_voxel_abs_pilot/best.pth
ls libero/checkpoints/C_voxel_rel_pilot/best.pth

# Wandb project: para_libero
#   runs: A_i_indist_s5, A_ii_vpOOD_s2w12, A_iii_posOOD_s2w12,
#         B_voxel_abs_pilot, C_voxel_rel_pilot
```

### Honest scope notes

What I had to defer due to time:
- Closed-loop eval pipeline for the voxel models (eval_voxel_closed_loop.py)
- Full 3-setting training for (B) and (C)
- The lookahead-K retraining that would fix the 0% pathology

What worked end-to-end:
- 7-DoF head bolt-on
- Multi-target supervision with iterative backward (no OOM)
- Voxel-token architecture (both abs + rel variants) compile + train
- Closed-loop env eval pipeline with --shift_dx/--shift_dy looping over test positions
- All 3 settings actually trained on the right data splits

— backbones
