---
## New investigation — autoregressive transformer policy with high-res patches + relative-to-EEF positional embeddings (2026-05-16)

From Cameron via manager. **Read this whole spec before starting.**

### Diagnosis Cameron gave

Real-world rollouts show two failure modes in the current `TrajectoryHeatmapPredictor`:

1. **Free-space points get skipped.** Predicted EEF jumps from the current EEF location directly to the grasp target, ignoring the intermediate free-space trajectory. Hypothesis: free-space image features are not semantically distinctive (gripper-above-table looks the same regardless of pose), so the heatmap can't tell "EEF at position A mid-air" from "EEF at position B mid-air" — the network learns to put mass on the salient destination instead.
2. **Independent per-timestep argmax → no temporal smoothness.** The current model emits 6 future heatmaps in one shot (`volume_head` → `(B, n_window, N_HEIGHT_BINS, H, W)` in `model.py:79`). Each timestep's argmax is independent, so the trajectory jitters back and forth between local modes.

### Proposed architecture (Cameron's framing, with details filled in)

**Autoregressive transformer over high-res patches**, with a learnable EEF token per timestep cross-attending to the spatio-temporal patch stack. Context length: ~20 past timesteps. Predict one EEF position per step; at inference, sample, feed back as past EEF, continue.

Token sequence (causal along time):
```
[Patches(t-20), EEF(t-20)], [Patches(t-19), EEF(t-19)], ..., [Patches(t-1), EEF(t-1)], [Patches(t), EEF_query(t)]
```

- **Per-frame patches**: DINOv2 features (probably frozen — cache-friendly + saves memory for long context). At 448×448 input, that's 28×28 = 784 patches/frame. 20 past + 1 current × 784 = ~16.5K tokens — heavy but tractable with FlashAttention at batch 2–4 on an RTX 6000. **Recommend**: start with 8 past frames at 28×28, scale up once it works. Optional optimization later: high-res for the 3–4 most recent frames, downsampled (14×14) for the older ones.
- **EEF tokens**: 1 learnable token per timestep, with its actual 2D EEF coord injected via positional embedding. The EEF token at the predicted step is a learnable `[EEF_query]` token.
- **Positional embeddings (three kinds, summed):**
  - Temporal (which timestep, 0–20)
  - Standard 2D for patch position
  - **Relative-to-current-EEF**: `(patch_xy - EEF(t-1)_xy)` — this is the "all tokens have relative 2D positional embedding to current EEF position" Cameron asked for. Use EEF(t-1) as the anchor (the most recent *known* EEF). This is the key inductive bias for fine-grained motion in free space — the model can think "the patch 30 px to my left" instead of "the patch at absolute coord (192, 240)".
- **Attention**: causal mask along time dimension. Within a frame, full bidirectional between EEF token and patch tokens.
- **Output head**: take the `[EEF_query(t)]` token's final hidden state → MLP → logits over the current frame's high-res spatial grid (e.g., 56×56 = 3136 classes, or directly per-pixel at 224×224). Treat next-EEF as categorical classification over a spatial grid; sample/argmax → next EEF position.
- **Loss**: cross-entropy over the spatial grid, target = grid cell containing GT next EEF. Optional auxiliary: small smoothness penalty on multi-step rollout during training (L2 between consecutive predicted positions).

### Why this should address both failure modes

- **Free-space coverage** comes from the relative-to-EEF positional embedding + autoregressive conditioning on past EEF tokens: the model now has explicit access to "where was I just a moment ago" and predicts a small displacement from that, rather than guessing an absolute destination from one frame.
- **Smoothness** comes from autoregression itself + the small-displacement bias.

### Concrete starting setup (don't try to do all of this at once)

- **Task**: pick one LIBERO task with a long free-space movement so the failure mode is visible — Cameron's call which exact task; LIBERO-spatial task 0 is the usual starting point but LIBERO-long or LIBERO-goal will show the issue more clearly. Ask him.
- **Repo location**: new file `/data/cameron/para/libero/model_autoregressive.py` (sibling to the existing `model.py`, `model_act.py`, etc.). Add a new `--model_type autoregressive` branch to `train.py` and a new data sampler that yields the 8-frame (later 20-frame) history window — either extend `data.py` or add `data_ar.py`.
- **Backbone**: frozen DINOv2-base/14 (matches existing PARA setup; cacheable per-episode if you want to speed up training).
- **Transformer**: 4–6 layers, d_model=384, 6 heads. Small to start; scale once you see signal.
- **Context**: 8 past frames (not 20) for the first run. Validate compute, then push to 20.
- **Training**: teacher forcing during training (use GT past EEF positions). Autoregressive rollout only at eval.
- **Eval metrics** (add these to eval.py):
  - Trajectory **jerk** (mean ‖ΔΔp‖) — should be lower than baseline
  - **Free-space coverage**: fraction of GT trajectory points within ε pixels of predicted trajectory (not just endpoints)
  - Standard rollout success rate (vs. existing PARA baseline)
- **Reference checkpoint to beat**: whichever PARA checkpoint is the current LIBERO baseline (likely `/data/cameron/para_normalized_losses/libero/checkpoints/para_v2_exp3_near/best.pth` per your recent figure_maker handoff).

### Open design choices — flag back to Cameron before locking in

1. **Predicted-EEF vs. teacher-forcing during inference** — true AR (feed predictions) compounds errors; teacher-force with running estimate is safer for long horizons. Probably hybrid: TF for first few steps, AR after.
2. **Output discretization granularity** — 56×56 grid is reasonable; finer = more classes = harder optim. Could also use a coarse-to-fine head (predict patch, then offset within patch).
3. **Whether to keep multi-step `n_window=6` prediction or strictly single-step autoregressive** — Cameron said "autoregressive per timestep" → strictly 1-step. Confirms simpler design.
4. **DINO frozen vs. trainable** — start frozen; unfreeze only if loss plateaus high.

### First milestone (1–2 days)

Get the model training on one task with 8-frame context, frozen DINO, teacher-forced. Don't worry about eval yet. Show me:
- Training loss curve (wandb)
- Predicted vs. GT trajectory overlay on 2–3 validation episodes
- Memory / step time at batch=2 and batch=4

Once the loss is going down and trajectories look sensible, expand to 20-frame context, add eval metrics, run rollouts.

### Why this is the right next step (and what's risky)

It's the right next step because the failure modes Cameron described are structural in the current single-frame heatmap design — no amount of more data fixes "the network can't see its own past trajectory". This architecture gives it that.

What's risky: (a) compute cost at 20-frame context, (b) AR rollouts compound prediction errors so the inference policy may be brittle even if training loss is low, (c) the spatial-grid categorical output may not generalize as well as a continuous heatmap.

If after a week you can't get this to beat the single-frame PARA baseline on the chosen task, ping me and we'll reassess — there are simpler fixes (path-prior heatmap targets, temporal-smoothing post-hoc) that we should consider before scaling further.

— manager

---
## Extension — voxel-token variants + 3-setting eval matrix (2026-05-17)

From Cameron via manager. **He explicitly said: don't come back with questions, use best judgement, come back with evaluations on the 3 settings.** This is the full scope.

### Acknowledging your progress

v1 is trained and the open-loop diagnostic shows the free-space-coverage failure mode is gone (100% / 99.85% on spatial / goal). Solid. v2 mid-training. The remaining open items (closed-loop env eval, height/rot/grip heads, baseline comparison) are now load-bearing for what Cameron wants next.

### What's new

Cameron wants **3 models × 3 settings = 9 closed-loop evals**, in a single comparison table. The model family expands from 1 (the AR you have) to 3 (you + two voxel-token variants).

### Two new model variants — voxel-token AR

The pitch (Cameron's words, rephrased): **the action space IS the token space.** Take the same 56×56×32 pixel-aligned volume your existing `volume_head` produces (image-aligned x,y + 32 height bins for unprojection), but treat each voxel as a transformer token. The token feature is:

```
token_feat[x,y,z] = Linear(PositionalEncoding(xyz)) + image_feature[x_pix, y_pix]
                    └─────── geometry channel ──────┘   └─── appearance ───┘
```

Image features come "for free" because the volume is image-aligned: each (x,y,z) cell projects to the same pixel as (x,y,0). So no explicit projection step.

**Variant B — absolute xyz**: `PositionalEncoding(xyz)` where xyz is the unprojected world coordinate of the voxel.

**Variant C — EEF-relative xyz**: `PositionalEncoding(xyz - eef_start_xyz)` where `eef_start_xyz` is the EEF position at the start of the trajectory (t=0 of the episode / first frame of the context window — pick one and document it). The delta volume gives translational equivariance for free, which is exactly what should help with the position-OOD setting (iii).

### Compute discipline for voxel variants

56×56×32 = 100,352 voxels per frame. Full self-attention is infeasible. Use **cross-attention only**:

- Past 20 EEF tokens (small, ~20) form the causal context
- Current frame's voxels (~100k) are KV-only for the EEF query token
- EEF query at step t cross-attends to (past EEF tokens with causal mask) ∪ (current voxels, all visible)
- No voxel→voxel attention

This is essentially a Perceiver-IO setup: small latent (EEF tokens) reading from a large input (voxel grid). Total compute is O(K × V) not O(V²). With K=21 query+context tokens and V=100k voxels, that's ~2M attention scores per layer — comfortable.

If 100k voxels is still too tight memory-wise, drop to 28×28×16 = 12,544 voxels (half the spatial, half the height resolution). Don't go lower than that — defeats the point.

### The 3 eval settings

Each setting = a different training data split + matching test split. So each model gets trained 3 times (once per setting).

| Setting | Training data | Eval data | Notes |
|---|---|---|---|
| **(i) In-distribution** | Standard `libero_spatial --task_id 0` train demos | Same task, standard `eval.py --teleport --zero_rotation --max_steps 600` | Baseline check — must be competitive with single-frame PARA reference |
| **(ii) Viewpoint OOD** | `/data/libero/ood_viewpoint_v3_splits/<train>` | `/data/libero/ood_viewpoint_v3_splits/<test>` | Same pattern as `para_v2_exp3_*` runs. The vp_outer_half setup from your earlier figure_maker handoff. |
| **(iii) Left/right position OOD** | `/data/libero/ood_objpos_task0_train` | `/data/libero/ood_objpos_task0_test` | The exp3 left-vs-right split. Use the same `eval.py` flags as setting (i). |

**Eval = closed-loop env rollout success rate.** Open-loop demo replay (what you've been running for diagnostics) is not the final metric — Cameron means `eval.py` with the LIBERO env loop. **This requires the height/rot/grip heads** (your open task #11). Do those first.

### The hypothesis being tested

- Setting (i) — all three models should do roughly the same (in-dist). If voxel variants are much worse, something is wrong with the architecture.
- Setting (ii) viewpoint OOD — voxel-token models should do better than the 2D AR because unprojected xyz is viewpoint-invariant in world coords, while patch positions are view-dependent. Variant B and C should both help; not obvious which more.
- Setting (iii) position OOD — **C should dominate**. EEF-relative coordinates mean the model sees "reach an object 30 cm to my front-right" rather than "reach world position (0.4, -0.2)". That should transfer across left/right.

### Required deliverable

Single comparison table:

|             | (i) in-dist | (ii) viewpoint OOD | (iii) pos OOD |
|---|---|---|---|
| (A) 2D AR (your v2)        | %SR | %SR | %SR |
| (B) Voxel + abs xyz        | %SR | %SR | %SR |
| (C) Voxel + EEF-relative   | %SR | %SR | %SR |

Plus the reference PARA single-frame baseline in each cell for context. Plus trajectory smoothness/coverage metrics for the AR rows (you already have the eval code for those).

### Suggested checklist (paste into your TaskList and update as you go)

```
[ ] Bolt height/rot/grip heads onto v2 AR — needed for closed-loop env eval
[ ] Train (A) v2 on three settings:
    [ ] in-dist, viewpoint OOD train split, position OOD train split
[ ] Closed-loop env eval (A) on three matching test splits
[ ] Implement model_voxel_ar.py — Perceiver-IO style; abs-xyz PE
[ ] Train (B) on three settings
[ ] Closed-loop env eval (B) on three test splits
[ ] Implement model_voxel_ar_rel.py — same skeleton, swap PE to (xyz - eef_start_xyz)
[ ] Train (C) on three settings
[ ] Closed-loop env eval (C) on three test splits
[ ] Build final 3×3 comparison table + trajectory diagnostics
[ ] Report to manager via outbox.md, include training cmds + ckpt paths
```

Update `status.md` as you progress through this list so we can see where you are without pinging.

### Operating instructions (Cameron's emphasis)

- **Do not ping back with questions.** Use your best judgement on every design choice not pinned down above. If something is genuinely blocking — like an OOM you can't size around or a data split that doesn't exist — flag it in `status.md` as `blocked` with a one-line reason, but don't expect a response; just route around if you can.
- **Order of operations**: do the height/rot/grip heads + closed-loop eval for (A) first. That's the longest pole AND it validates the eval harness for all three models. Then (B), then (C). Don't try to parallelize the model development across (A)/(B)/(C) — sequence them.
- **Budget**: 9 trainings + 9 evals at ~10 min training + ~20 min eval each ≈ 5 hours of compute. If voxel variants take longer to train, that's fine — note it.
- **Default to short pilot runs** (e.g., 500 steps, --max_minutes 3) for the voxel variants before committing to a full training run — confirms the architecture trains at all before burning real GPU.

— manager

---
## Model surgery — refactor grip/rot head + PCA-1D rotation (2026-05-20)

From Cameron via manager. **Cameron's diagnosis:** denser trajectories look really good (your AR + volume-KV work paid off), but gripper and rotation values are not as good. He observed the gripper oscillating open/closed and rotation pegged to a narrow range in the izzy deploy. He wants two specific architectural fixes, then a retraining/finetune.

### Current state (what I confirmed by reading the code)

`model_dino_volume_kv_full.py` already has the two-stage shape Cameron is asking for: volume head predicts spatial+height, then `predict_from_keypoints` indexes `pixel_feats` at teacher-forced `kp_zyx` and runs a tiny transformer (`TF_LAYERS=2`) over the T=8 trajectory tokens before emitting grip + rot. So the structural framing of "EEF feature predicts spatial only, then a second transformer over teacher-forced features predicts grip/rot" is **mostly already there.** What Cameron wants is to deepen and clean up that head, plus rework the rotation target.

Concretely:

### Change 1 — head architecture (`model_dino_volume_kv_full.py`)

Three specific edits to `DinoVolumeKVFull`:

1. **`TF_LAYERS = 2` → `TF_LAYERS = 5`** (Cameron explicitly said "a few (~5) layer transformer"). Keep `d_model=128`, `nhead=4`, dim_ff=4d. If memory or runtime gets tight, can drop to 4 — but try 5 first.

2. **Replace learned `z` and `t` embeddings with sinusoidal PE.** Currently:
   ```python
   self.z_tok_emb = nn.Embedding(n_height_bins, d_model)
   self.t_tok_emb = nn.Embedding(n_window,     d_model)
   ```
   should become sinusoidal encoders that index `n_height_bins` (z) and `n_window` (t). Use the **same sin formulation** the parent `DinoVolumeKV` already uses for `height_enc='sin'` and `time_enc='sin'` — that's what Cameron means by "the same sinusoidal height and timestep embeddings". Keep `y_tok_emb` and `x_tok_emb` as-is for now (or drop them — pixel features already carry y/x positional info implicitly since they're indexed at (y, x); your call — but if you drop them, ablate to confirm it doesn't hurt).

3. **PCA-1D rotation output.** Currently:
   ```python
   self.rotation_out = nn.Linear(d_model, 3 * n_rot_bins)
   # → rotation_logits: (B, T, 3, n_rot_bins)  CE per euler axis
   ```
   should become:
   ```python
   self.rotation_out = nn.Linear(d_model, n_rot_bins)
   # → rotation_logits: (B, T, n_rot_bins)  CE over 1D PCA axis
   ```
   Bump `N_ROT_BINS` from 32 to ~48 or 64 if you want — since you're now using a single axis with more dynamic range, finer bins are cheap and probably worth it.

### Change 2 — PCA rotation target (data pipeline + decode)

**Precompute step (one-time, runs before training).** New script `precompute_rotation_pca.py` next to the training script. For the dataset being trained on (e.g., izzy first_mobile_collection):
1. Walk every demo, load `eef_euler_t` (or whatever rotation array the dataset uses — confirm from `data_smith300_volume.py`). Stack into `(N, 3)` array.
2. Compute `μ = mean(rot, axis=0)`, then PCA via SVD on `(rot - μ)`. Take **PC1 explained-variance ratio**.
3. **Sanity gate**: if PC1 explained variance < 0.85, abort and ping me — the data has too much rotational variance for 1D collapse to work. (Cameron is betting it's narrow-domain; verify before committing.) If ≥ 0.85, proceed.
4. Project: `pca_1d = (rot - μ) @ v1` (where `v1` is the top eigenvector, shape (3,)). Find `pca_min`, `pca_max` from the dataset.
5. Save `{mean: μ, principal_axis: v1, pca_min, pca_max, ev_ratio: ev1}` to a `.npz` next to the dataset (e.g., `rotation_pca_basis.npz`). Document the file location in a comment in the training script.

**Data loader.** In `data_smith300_volume.py` (and `data_da3_volume.py` if you train on libero too), replace per-axis binning:
```python
# OLD
self.rot_bin_t = self._bin_rotation(self.eef_euler_t)   # (N, 3) long, range [0, 32)
# NEW
proj = (self.eef_euler_t - self.pca_mean) @ self.pca_v1            # (N,) float
norm = (proj - self.pca_min) / (self.pca_max - self.pca_min + 1e-8)
self.rot_bin_t = (norm.clamp(0, 1) * (N_ROT_BINS - 1)).long()      # (N,) long
```
Load the PCA basis from the .npz file at dataset init.

**Training loss** (`train_dino_volume_kv_full.py` or whichever script): replace the 3-axis CE mean with a single CE over the 1D bin axis. Drop the `per-axis` loop in `rotation_dist_strip` and the equivalent in the loss path.

**Decode** (eval + deploy): predicted bin → 1D pca value → 3D euler:
```python
pca_val = pca_min + (bin + 0.5) / N_ROT_BINS * (pca_max - pca_min)
euler_pred = pca_mean + pca_val * pca_v1     # (3,)
```
Add this decode helper to a shared utils file so both lab-side eval and mac-side deploy can import it. **Update the `_stats.json` on the deploy side** with the new PCA basis so mac's `deploy_dino_kv.py` decodes correctly — coordinate with mac via outbox/inbox.

### Retraining / finetuning plan

The architectural changes are surgical but the head's state-dict shape changes (rotation_out and the transformer's depth and z/t embedding types), so a `strict=False` load will leave most of the head re-initialized. The volume head (which is the load-bearing one) loads cleanly. Two paths:

- **Fast path — finetune**: load most recent good checkpoint (looks like `checkpoints/da3_dino_kv_full_izzy2_transformer_v7/best.pth` per the dir listing), `strict=False`, freeze the DINOv3 backbone and the volume head for the first ~500 steps to let the rebuilt head catch up, then unfreeze. Should converge fast since the volume head and image features are unchanged.
- **Clean path — train from scratch**: same data + new architecture from scratch. Cleaner ablation but slower (likely 1–2 hrs on RTX 6000).

Default to the fast path. If you see weirdness (volume head loss creeping up after unfreeze), switch to clean path.

Run name suggestion: `da3_dino_kv_full_izzy2_pca_t5L_v1` so it's clear in the wandb log what's different.

### Validation criteria

Before declaring done:
1. **PCA sanity**: PC1 EV ratio reported in wandb at startup, ≥ 0.85.
2. **Loss**: rotation CE goes down faster than the v7 baseline (since it's now a 1D objective instead of 3 conflicting 1D objectives summed).
3. **Joint-top1**: pixel + height + grip + rot top-1 hold-out accuracy ≥ v7 numbers.
4. **Per-timestep gripper trajectory plot** on a validation episode: should be monotonic-ish (open → closed during grasp, closed → open during release), not oscillating between bins.
5. **Decoded rotation overlaid on an episode**: should hold a steady value through approach + grasp, only varying during pre-grasp orientation alignment. Compare visually to v7.

When all five pass, write to outbox.md with: PCA EV ratio, final val numbers, comparison table vs v7, training cmd, ckpt path. Coordinate with mac (via mac/inbox.md) to push the new ckpt + updated `_stats.json` (including the PCA basis) to the Mac so deploy_dino_kv.py picks up the change.

### Operating instructions

Use best judgement on everything not pinned here. The PCA sanity gate (EV ≥ 0.85) is the one thing where you should pause and tell me if it fails, since it invalidates the whole change. Everything else: proceed and report.

— manager
