# Inbox

## 2026-05-19 — from backbones: PARA architecture has settled — dino_kv volume formulation

**Bottom line:** After ~36h of ablations on smith300, the production PARA architecture is
**DINOv3-S/16+ backbone with a factored KV-volume head** (no DA3, no diffusion, no
EEF-attention). On train set it hits **9.32 px** 2D pix err and ~2mm height err. The
matching izzy_home_first_record finetuned checkpoint has been handed to the mac agent
for inference deployment.

### Architecture details

- Backbone: DINOv3 ViT-S/16+ (30M params, embed_dim=384) — vanilla huggingface weights
- Head: 1×1 conv refine → 48-d per-pixel features F ∈ R^(48 × 56 × 56)
- Volume produced by **bilinear scoring** against fixed sin/cos positional keys:
    key(t, z) = sin/cos(t) + sin/cos(z), each 48-d
    logit(b, t, z, u, v) = F(u,v) · key(t,z) / √D × exp(temperature)
- Output: (B, T=8, Z=32, H=56, W=56) joint logits → CE per timestep on the GT voxel
- Plus MLP heads (gripper, rotation) indexed at GT pixel during training (teacher forcing)
- File: `model_dino_volume_kv_full.py`

### What we learned (relevant for the paper narrative)

1. **Volume formulation > Heatmap-only** by 4×. Adding the Z-axis joint supervision
   (DA3_pixel 58 px train → DA3_volume_v3 15 px) showed that giving the model rich
   per-sample labels (pixel + height jointly) helps even just the 2D objective.

2. **Head architecture > Backbone scale** for this task. Surprising result:
   - DINOv3-L (305M) + dense conv head: 9.88 px train
   - DINOv3-S/16+ (30M) + KV-factored head: **8.70 px** with learned/learned, **9.32 px** with sin/sin
   The 10× bigger backbone gets BEATEN by the smaller backbone with a better head. DA3-LARGE
   was over-engineered for this task. Worth a paragraph in the method section.

3. **Capacity vs generalization tradeoff in encodings:**
   - Learned embeddings fit training best (8.70 px) — full expressive power
   - Sinusoidal embeddings generalize best on val (val_pix=21.8 for sin/learned)
   - Cameron picked **sin/sin as the default** for the cleaner formulation (zero embedding
     params, ordinal prior on both axes). Train-fit gap of 0.6 px isn't load-bearing.

4. **The val-pix metric is multimodal-misleading.** Smith300 val has multiple plausible
   futures per scene, so val_pix bottoms out at ~20-30 px irreducible ambiguity. We
   re-anchored on **train-set 2D argmax pix err** as the headline accuracy number. Memory
   updated. Worth flagging in the paper that val_pix doesn't tell the whole story; a
   closed-loop rollout success rate is the real metric.

5. **Soft-argmax decoding** at inference gives a free ~31% pix-err reduction (35.8 → 23.3 px
   on v3) versus argmax — but training with soft-argmax L1 loss collapses to the centroid
   of multimodal futures (visible in keypoints viz). So: decoder choice matters, train with
   CE, decode with soft-argmax for tighter localization.

6. **Diffusion failed** for this task even after the [-1,1] normalization fix. Architectural
   issue (model never learns to use image conditioning, predicts marginal noise). Not the
   right tool for sparse keypoint heatmaps — CE volume is significantly easier to optimize.

### What we DIDN'T try (defer / future work)

- Past-N-frame history encoding (Cameron mentioned it; never got to it)
- MAP two-iteration denoising (the cheap diffusion variant)
- DA3-LARGE-1.1 / GIANT-1.1 (retrained HF variants, possibly better for in-lab scenes)
- Full closed-loop rollout success rate on real robot — the next milestone after the
  Mac agent finishes the deploy script

### Status of the agent fleet

- **figure_maker**: building a method diagram for the dino_kv volume mechanism (inbox
  entry sent today; data pre-extracted at `/data/cameron/para/paper/figs/data/volume_kv_example.npz`)
- **mac**: copying the production checkpoint locally + updating the deploy script with
  the new volume inference setup (inbox entry sent today; checkpoint at
  `/data/cameron/para/libero/checkpoints/da3_dino_kv_full_izzy_ft/latest.pth`)
- **backbones** (me): leaderboard finalised, awaiting next direction. No active runs.

### What CLAUDE.md / EXPERIMENTS.md should be updated to reflect

- Architecture section in CLAUDE.md still describes the old DINO + 1×1 conv volume head
  with N_WINDOW=12, N_HEIGHT_BINS=32. The new arch:
  - N_WINDOW = 8 (not 12)
  - Pred grid = 56×56 (was 64×64)
  - Backbone = DINOv3 ViT-S/16+ (was unspecified)
  - Head = factored KV (not dense 1×1 conv to T×Z channels)
  - Loss = volume CE only (depth distillation dropped — DA3-specific)
  - Saved memory updates: `feedback_da3_large_default`, `feedback_train_argmax_metric`,
    `feedback_latest_not_best_ckpt`, `feedback_default_arch_dino_kv_sin_sin`.

- EXPERIMENTS.md should add the smith300 leaderboard (full table in
  `/data/cameron/agents_stuff/agents/backbones/status.md` — see "Final standings"
  near the top of the file).

### Suggested narrative for the method section

Given the train-fit findings, the paper's method section should probably emphasise:
1. The volume formulation (joint pixel+height) as the key insight enabling richer training signal.
2. The KV-factored head as the parameter-efficient instantiation that does most of the work.
3. The sin/sin positional encoding as the principled choice — ordinal prior on time and
   height bins, zero extra params, defensible to reviewers without arch ablations dominating
   the paper.
4. Briefly mention that bigger backbones / dense heads were ablated and found worse —
   a small results table fits the head > backbone story without paper bloat.

If you want a fuller writeup with numbers, the source-of-truth document is
`/data/cameron/agents_stuff/agents/backbones/status.md` (top of file).