# Production 1-view query-MLP architecture — porting guide (backbones → yams, 2026-05-26)

## (1) Canonical files to copy

For a **single-view + real-robot data** port (NOT libero sim), the smith300 flow is the right template since YAM data is structurally similar (per-demo directories of frames + EEF/quat/gripper trajectories):

```
/data/cameron/para/libero/model_dino_volume_query.py     # the model — production arch
/data/cameron/para/libero/train_smith300_volume.py       # real-robot trainer (template)
/data/cameron/para/libero/data_smith300_volume.py        # cached-trajectory dataset
/data/cameron/para/libero/utils.py                       # recover_3d_from_direct_keypoint_and_height
/data/cameron/para/libero/precompute_rotation_pca.py     # 1D PCA basis for rotation (run per-dataset)
/data/cameron/para/libero/precompute_rotation_kmeans.py  # alternative: K-means rotation prototypes
```

If you want the libero-style trainer (for closed-loop eval scaffolding) as a secondary reference, see `train_libero_query.py` and `eval_libero_query.py` in the same directory.

## (2) One-paragraph architecture & loss

**Architecture (`DinoVolumeQuery` in `model_dino_volume_query.py`):** DINOv3 ViT-S/16+ extracts patch tokens at `embed_dim=384`. A 2-stage refine (1×1 conv to `d_feat=48` then 3-layer 3×3 conv) produces `F ∈ R^(48 × 56 × 56)` after bilinear upsampling. A per-step query is built from `concat(F[start_pix], CLS)` projected to `d_model = d_feat + d_sin_z + d_sin_t`, replicated across `T = N_WINDOW`, and routed through **5 AdaLN-Zero residual MLP blocks** conditioned on `sin(t)`. The 5-block output `q_spatial[b, t]` is split into `(q_F, q_z, q_t)` of dims `(48, 24, 24)`. Volume logits factor as `logit(b, t, z, y, x) = q_F · F[y, x] + q_z · sin(z) + q_t · sin(t)` — never materialise the (T, Z, H, W) volume in memory. Three heads on the penultimate per-timestep representation: gripper bins (`Linear → 32`), rotation (1D PCA: `Linear → 32` over the PC1-projected scalar), and the spatial query head.

**Loss:** joint cross-entropy over the flattened `(Z × H × W)` per-timestep volume with the GT voxel computed from `(target_height_bin, gt_pix_y, gt_pix_x)`. Plus per-timestep CE on gripper bins and rotation bins (1D PCA scalar discretised into 32 bins). Default weighting: `total = volume_loss + 0.5 * gripper_loss + 0.5 * rotation_loss`.

## (3) Non-obvious dependencies

| What | Path / detail |
|---|---|
| DINOv3 repo (`torch.hub.load(source='local', ...)`) | `DINO_REPO_DIR=/data/cameron/keygrip/dinov3` |
| DINOv3 weights (ViT-S/16+) | `DINO_WEIGHTS_PATH=/data/cameron/keygrip/dinov3/weights/dinov3_vits16plus_pretrain_lvd1689m-4057cbaa.pth` |
| Rotation PCA basis | per-dataset. For libero it's `/data/cameron/para/libero/rotation_pca_basis_libero_spatial_t0.npz`. **Run `precompute_rotation_pca.py` on YAM training quats first.** |
| Sinusoidal PE helper | `sinusoidal_features(n, d)` in `model_dino_volume_query.py` — local, not external |
| Image normalization | ImageNet mean/std applied to `[0, 1]` input by the dataset loader (`data_smith300_volume.py`). Don't double-normalize. |
| `--vis_every_steps 0` triggers a zero-div bug — use `100000` to effectively disable | |

## (4) Default config / hyperparams

| Hyperparam | Value | Notes |
|---|---|---|
| Image size | 504 (model default) / 448 (libero data) | Match training/eval render size carefully |
| Patch size | 16 | DINOv3 fixed |
| Patch grid → upsample | 28×28 → 56×56 | bilinear + 3-conv refine |
| `N_WINDOW` (T) | 8 for libero / 50 for smith300 dense | Pick by trajectory frame rate |
| `N_HEIGHT_BINS` (Z) | 32 | Min/max world-Z from dataset stats |
| `N_GRIPPER_BINS` | 32 | |
| `N_ROT_BINS` | 32 | 1D PCA discretised |
| `D_FEAT` | 48 | Per-pixel feature dim |
| `D_SINZ`, `D_SINT` | 24, 24 | Sinusoidal PE dims |
| `D_COND` | 96 | AdaLN conditioning dim |
| `N_BLOCKS` | 5 | AdaLN-Zero MLP blocks |
| `PRED_SIZE` | 56 | Feature/volume spatial grid |
| Batch size | 8 | Comfortable on A100/H100 at 448 |
| LR | 5e-5 | AdamW, weight_decay 1e-4 |
| Epochs | 20-30 | Until train_pix_argmax plateaus |
| `frame_stride` | 3 | At 20 Hz → ~6.7 Hz effective |
| `rotation_mode` | `'1d_pca'` | Use `'kmeans'` if PC1 EV < ~80% (multimodal rotation) |
| Save-best by | `latest.pth`, not `best.pth` | val is multimodal-noisy — see `feedback_latest_not_best_ckpt` |

## Watch-outs for YAM bimanual

- **Single-arm first.** Production query-MLP is single-EEF. Bimanual needs two scoring streams or interleaving — separate experiment after you've ported single-arm.
- **Per-demo camera pose** — if YAM has varied camera per demo (likely for hand-eye calib), use the per-batch BEV table approach from `train_libero_2view.py` (`build_bev_world_xyz_table_batched`). For fixed camera, the static table is fine.
- **Headline metric** is **train_pix_argmax** (2D argmax-decoded train pix err), NOT val (val is multimodal-noisy and caps at the ambiguity floor). See `feedback_train_argmax_metric` memory.
- **Compare runs on train_pix/grip_acc/rot_acc**, not val.

Ping back if you need anything else — happy to help debug if the rotation PCA fits weirdly (PC1 EV low) or the volume CE explodes early.

— backbones
