
---
## 2026-05-22 (sideview) — from backbones: home_towel_SIDEVIEW kmeans ckpt — please copy

Same arch as the previous home_towel kmeans8 ckpt, finetuned on a new sideview dataset
(`home_towel_sideview` — 6 episodes, 320 samples, side-camera view of the same task).
Resume base was the `home_towel_kmeans8` ckpt (warm start from k-means trained on the
overhead view).

Currently at ep 74/200 (training in background, latest.pth refreshes every 10 epochs).
**Re-pull latest.pth after run completes** in ~17 more min for the final version.

### Files to copy

- **Ckpt** (~117 MB):
  `/data/cameron/para/libero/checkpoints/dino_query_home_towel_sideview_kmeans8_v0/latest.pth`
- **Optional, also embedded in ckpt**:
  `/data/cameron/para/libero/rotation_kmeans_basis_home_towel_sideview_n8.npz`
- **Model file** (only re-copy if you don't already have the k-means-aware version from
  the previous home_towel inbox message): `/data/cameron/para/libero/model_dino_volume_query.py`

### Centroid eulers (new — sideview rotations)

| bin | euler (xyz) | n_samples |
|---|---|---|
| 0 | `[-2.31, -1.37, -1.30]` | 22 |
| 1 | `[-2.18, -1.39, -1.03]` | 46 |
| 2 | `[-1.31, -1.54, -1.94]` | 38 |
| 3 | `[+2.52, -1.37, +0.30]` | 25 |
| 4 | `[-2.31, -1.53, -0.70]` | 44 |
| 5 | `[+2.86, -1.35, +0.21]` | 47 |
| 6 | `[-2.08, -1.44, -1.34]` | 51 |
| 7 | `[+1.65, -1.47, +1.26]` | 47 |

These are different rotations than the overhead-view k-means basis — the side camera
captures different gripper orientations, so the cluster centroids shift. Bin 6 is the
most populated (51/320 ≈ 16%) — probably the dominant pose at the start of grasps in
the sideview perspective.

### Deploy — same as previous kmeans ckpt

- Same model constructor: `DinoVolumeQuery(..., rotation_mode='kmeans', kmeans_n_clusters=8)`
- Same decode: `euler = sd["kmeans_centroids_euler"][rot_bin]`
- The ckpt embeds `kmeans_centroids_quat/euler/bin_counts/n_clusters`

### Recommended

This is the right ckpt for **sideview-deploy** of home_towel-style tasks. Keep the prior
overhead `dino_query_home_towel_kmeans8_v0` ckpt around for overhead-cam deploy. Pick
whichever matches the deploy camera angle.

— backbones

---
## 2026-05-22 (latest) — from backbones: k-means rotation home_towel ckpt + inference code

**Cameron's insight:** the joint PCA on multi-camera datasets was producing OOD rotations
(joint-mean sits *between* the two datasets' rotation clusters, with interpolated bins
along PC1 that no training sample actually occupies). Fix: cluster training rotations
into N discrete prototypes — every prediction is *exactly* one of N observed rotations,
zero possibility of OOD by construction.

Trained on `home_towel` only (5 episodes, 231 frames, 226 samples), N=8 k-means
centroids in canonical-quaternion space (force `w ≥ 0` to kill antipodal ambiguity).
Final train fit: **train_pix=7.05 px, train_rot_acc=0.92, train_grip_acc=0.84**.

### 1. Files to copy

- **Ckpt** (~117 MB):
  `/data/cameron/para/libero/checkpoints/dino_query_home_towel_kmeans8_v0/latest.pth`
- **Updated model file** (k-means mode added):
  `/data/cameron/para/libero/model_dino_volume_query.py`
- **Optional, also embedded in ckpt**:
  `/data/cameron/para/libero/rotation_kmeans_basis_home_towel_n8.npz`

### 2. Embedded ckpt fields (new k-means-specific)

```python
sd["rotation_mode"]            # "kmeans"
sd["kmeans_n_clusters"]        # 8
sd["kmeans_centroids_quat"]    # (8, 4) — quat centroids, sign-canonical
sd["kmeans_centroids_euler"]   # (8, 3) — euler XYZ at each centroid
sd["kmeans_bin_counts"]        # (8,)   — training sample count per bin
                               # for home_towel: [23, 38, 21, 29, 24, 46, 17, 33]
# unchanged from before:
sd["min_height"]   # 0.032
sd["max_height"]   # 0.181
sd["min_grip"]     # ...
sd["max_grip"]     # ...
sd["args"]["use_eef"]  # 1
sd["args"]["n_window"] # 50
```

### 3. Model construction (new flag)

```python
from model_dino_volume_query import DinoVolumeQuery
model = DinoVolumeQuery(
    n_window=50, n_height_bins=32, n_gripper_bins=32, n_rot_bins=32,
    image_size=504, pred_size=56,
    use_eef=True,
    rotation_mode='kmeans',                  # ← NEW
    kmeans_n_clusters=int(sd["kmeans_n_clusters"]),   # ← NEW (must match ckpt)
)
model.load_state_dict(sd["model_state_dict"], strict=False)
```

Output shape change: `out["rotation_logits"]` is now **`(B, T, n_clusters)`** instead of
`(B, T, n_rot_bins)` — the K classes are the centroid indices.

### 4. Inference decode — even simpler than PCA

```python
import numpy as np
from scipy.spatial.transform import Rotation as R

def decode_rotation_kmeans(rot_bin, sd):
    """rot_bin: int in [0, K). Returns (3,) Euler in XYZ convention."""
    return np.asarray(sd["kmeans_centroids_euler"])[rot_bin]
    # Or if you prefer quats:
    # return np.asarray(sd["kmeans_centroids_quat"])[rot_bin]   # (4,) xyzw

# In deploy loop:
rot_bin = out["rotation_logits"].argmax(dim=-1)    # (B, T)
for t in range(T):
    euler = decode_rotation_kmeans(int(rot_bin[0, t]), sd)
    # Use euler in the rest of your OSC delta pipeline
```

That's it. No PCA basis math, no projection ranges — just look up the centroid by index.

### 5. K-means bin counts (Cameron specifically asked to surface these)

The training data's rotation distribution per centroid:

| bin | euler (xyz) | n_samples |
|---|---|---|
| 0 | `[-1.76, -1.36, -1.51]` | 23 |
| 1 | `[+0.94, -1.49, +2.10]` | 38 |
| 2 | `[-2.10, -1.13, -0.86]` | 21 |
| 3 | `[-2.32, -1.45, -0.68]` | 29 |
| 4 | `[+0.42, -1.37, +2.72]` | 24 |
| 5 | `[-0.51, -1.44, -2.52]` | 46 |
| 6 | `[-1.52, -1.26, -1.55]` | 17 |
| 7 | `[-2.16, -1.49, -1.16]` | 33 |

Bin 5 is the most populated (46/231 samples ≈ 20%) — probably the "neutral grasp/hold"
pose. Bin 6 is the least populated (17 samples ≈ 7%). The bin-count distribution gives
you a sense of which rotations the model has seen most/least often during training; if
you ever see deploy snap to a low-count bin, that's a hint the model is reaching for a
rare-but-observed prototype.

### 6. Why this is better than the previous joint umi+robot ckpt

Previous joint PCA ckpt had train_pix=7.67 but rotation could land on interpolated
PC1 values that don't appear in training. K-means home_towel ckpt has train_pix=7.05
AND every rotation output is guaranteed to be one of the 8 training prototypes. The
inference rotation behaviour should be much more stable.

### 7. Previous ckpts

The joint umi+robot ckpt (`dino_query_umi_plus_robot_izzy_towel_t50_pca1d_v0`) is now
stale for towel deploys; **prefer this kmeans8 ckpt**. Keep the old one around if you
want to compare.

— backbones

---
## 2026-05-22 (later) — from backbones: SUPERSEDES previous umi_izzy_towel — joint umi + robot Izzy Towel ckpt

**Update:** previous umi-only run was killed; Cameron added a sibling `robot_izzy_towel`
dataset (side-cam, 9 episodes) and asked me to train on both jointly. New run combines:
- `umi_collect_izzy_towel`     (19 eps, overhead UMI handheld, ArUco-tracked EEF)
- `robot_izzy_towel`           (9 eps, side-cam smith300 robot arm, MuJoCo FK)

Combined: 28 episodes, 2026 samples, joint PCA basis recomputed (PC1 EV = 81%, principal
axis = `[-0.24, -0.02, -0.97]` — mostly negative yaw + some roll). Joint height range
`[-0.004, 0.288]`m covers both cameras.

### Files to copy (replaces the previous umi-only paths)

- **Ckpt**:  `/data/cameron/para/libero/checkpoints/dino_query_umi_plus_robot_izzy_towel_t50_pca1d_v0/latest.pth` (~117 MB)
- **PCA basis** (also embedded in ckpt):
  `/data/cameron/para/libero/rotation_pca_basis_umi_plus_robot_izzy_towel.npz`

Currently at ep 15/200, latest.pth saved at ep 9. **Re-pull after ~90 more min** when
the run completes for the final ckpt.

### Embedded ckpt fields

```python
sd["min_height"]   # -0.004
sd["max_height"]   # +0.288
sd["min_grip"]     # depends on dataset; both UMI + robot grippers normalised by joint range
sd["max_grip"]
sd["rot_pca_mean"] # [+0.502, -1.252, +1.368]
sd["rot_pca_axis"] # [-0.237, -0.021, -0.971]  (PC1 EV = 81%)
sd["rot_pca_min"]  # -1.70
sd["rot_pca_max"]  # +4.58
sd["n_rot_bins"]   # 32
sd["args"]["use_eef"]  # 1
sd["args"]["n_window"] # 50
```

### Deploy

Same arch (`DinoVolumeQuery(use_eef=True, rotation_mode='1d_pca')`) — no code changes.
At inference, the model uses the input image's DINO features to disambiguate which
"scene type" it's in (overhead vs side-cam), and the per-step query is conditioned on
the EEF pixel for the current scene. So you should just provide:
- The RGB image from whichever camera you're deploying with
- The current EEF pixel projected into that image

The 28-episode joint training gives broader rotation coverage than either dataset alone —
should be safer for deploy when the robot is doing towel folding regardless of viewpoint.

### Previous umi-only ckpt status

The earlier `dino_query_umi_izzy_towel_t50_pca1d_v0/latest.pth` is now stale; **prefer
this new joint ckpt** for towel deploys.

— backbones

---
## 2026-05-22 — from backbones: NEW umi_izzy_towel ckpt + PCA basis — please copy

**Context.** Cameron annotated a new UMI handheld-gripper dataset (`umi_collect_izzy_towel`,
19 episodes, towel folding task on overhead workspace). I patched the dataset loader to
prefer `joints['eef_pos']/['eef_quat']` for UMI/ArUco-tracked data (smith300 still uses
MuJoCo FK), verified the projection is plausible (~10-15 px offset from gripper centroid
visible in some panels — small calibration drift but trajectory direction is correct),
and finetuned the query-MLP from the desk_collect_1 ckpt.

Currently at ep 99/200 (still training in background, will overwrite latest.pth every 10
epochs). Mid-training fit at ep 99: train losses g=0.38, r=0.42, v=0.71. Val: val_pix=8.7px,
val_grip_acc=0.82, val_rot_acc=0.77.

### Files to copy

- **Ckpt**: `/data/cameron/para/libero/checkpoints/dino_query_umi_izzy_towel_t50_pca1d_v0/latest.pth` (~117 MB)
- **PCA basis** (also embedded in ckpt — pick whichever path your deploy reads):
  `/data/cameron/para/libero/rotation_pca_basis_umi_izzy_towel.npz`

Re-pull `latest.pth` after training fully completes (~70 more epochs) for the best ckpt.

### Embedded ckpt fields

```python
sd["min_height"]   # ≈ -0.004
sd["max_height"]   # ≈ +0.288 (~29 cm workspace height range)
sd["min_grip"]     # ≈ -0.088 (gripper closed)
sd["max_grip"]     # ≈ +1.806 (gripper open)
sd["rot_pca_mean"] # [+0.62, -1.23, +1.60]
sd["rot_pca_axis"] # [+0.076, +0.017, +0.997] — essentially pure yaw (PC1 EV = 91.3%)
sd["rot_pca_min"]  # -4.80
sd["rot_pca_max"]  # +1.51
sd["n_rot_bins"]   # 32
sd["args"]["use_eef"]  # 1
sd["args"]["n_window"] # 50
```

### Deploy

Same arch as the previous query-MLP 1D-PCA ckpts — **no model code changes needed**.
`DinoVolumeQuery(n_window=50, ..., use_eef=True, rotation_mode='1d_pca')` + load. Decode
helper from prior inbox applies unchanged.

### Note for camera setup on Mac

This dataset uses ArUco-tracked EEF (handheld UMI gripper, motor_ids=[0]) under an overhead
camera — different camera geometry from the izzy/desk side-cam datasets. The K matrix is in
the dataset's `meta.json`. If you're deploying with the same camera setup as the training,
you're set. If switching to a different camera, you'd need to re-extract `T_camera_arucoBase`
+ `T_W_baseBody_inv_aruco_offset` for the new mounting.

— backbones

---
## 2026-05-21 — from backbones: NEW desk_collect_1 ckpt — please copy

**Context.** Cameron annotated a new dataset (`desk_collect_1` — desk-mounted setup, 594
frames in 11 episodes) and asked me to finetune the query-MLP 1D-PCA model. Same arch,
same hparams, same training script — just a new dataset + a new PCA basis.

Resume base was the izzy3 ckpt (`dino_query_izzy3_t50_pca1d_v0/latest.pth`). Final train
fit is *better* than the izzy3 source:

| | train_pix | train v | train g | train r |
|---|---|---|---|---|
| izzy3 (resume base) | 7.08 px | 0.90 | 0.46 | 0.66 |
| **desk_collect_1**  | **6.92 px** | 0.48 | 0.29 | **0.10** |

### Files to copy

- **Ckpt**: `/data/cameron/para/libero/checkpoints/dino_query_desk_collect_1_t50_pca1d_v0/latest.pth` (~117 MB)
- **PCA basis** (separately, if your deploy reads it from disk): `/data/cameron/para/libero/rotation_pca_basis_desk_collect_1.npz`
  (You can ignore this since the basis is also embedded in the ckpt under
  `sd["rot_pca_mean"]`, `sd["rot_pca_axis"]`, `sd["rot_pca_min"]`, `sd["rot_pca_max"]`.)

### Deploy

Same as the previous query-MLP 1D-PCA ckpt — **no model code changes needed**. Construct
`DinoVolumeQuery(n_window=50, ..., use_eef=True, rotation_mode='1d_pca')` and load the new
state dict. The 1D PCA decode helper from the prior inbox message applies unchanged.

### Embedded ckpt fields (sanity check)

```python
sd["min_height"]   # ≈ 0.039 (desk_collect_1 range, padded 5%)
sd["max_height"]   # ≈ 0.187
sd["min_grip"]     # check from ckpt
sd["max_grip"]     # check from ckpt
sd["rot_pca_mean"] # (3,) — [+1.51, -1.42, +1.31] for desk_collect_1
sd["rot_pca_axis"] # (3,) — [+0.75, +0.01, +0.66] (≈ 75% roll + 66% yaw, different from izzy3's [+0.85, +0.02, -0.52])
sd["rot_pca_min"]  # ≈ -4.55
sd["rot_pca_max"]  # ≈ +0.63
sd["n_rot_bins"]   # 32
sd["args"]["use_eef"]  # 1
sd["args"]["n_window"] # 50
```

**Recommended default**: replace the izzy3 ckpt with this one for desk deploys; keep
izzy3 around if you want to compare cross-dataset behaviour.

— backbones

---
## 2026-05-20 (later) — from backbones: ALSO copy the per-pixel model — `--per_pix` deploy flag

**Context.** Cameron wants both the query-MLP (from the inbox message right below) AND the
per-pixel MLP available on Mac, selectable by a CLI flag. Per-pixel won on train fit:

| Run | train_pix (argmax) | train v | train g | train r |
|---|---|---|---|---|
| query EEF+CLS | 9.29 px | 0.86 | 0.40 | 0.81 |
| **per-pixel**     | **6.59 px** | 0.22 | 0.28 | 0.98 |

Per-pixel has lower train pix err and crushes the volume CE (0.22 vs 0.86, ~4× lower) — its
dense per-pixel volume head has more capacity than the rank-1 query dot product. Slightly
worse rotation due to the diagonal indexing scheme. **Both are worth deploying side-by-side
for closed-loop comparison.**

### 1. Additional files to copy to Mac

- **Model**: `/data/cameron/para/libero/model_dino_per_pixel.py`
- **Ckpt**:  `/data/cameron/para/libero/checkpoints/dino_per_pixel_izzy3_t50_v0/latest.pth` (117 MB)

Same DINO trunk as the query model, same izzy3 training data, same per-axis euler ranges in
the ckpt fields, same height/grip ranges.

### 2. Deploy script changes — add `--per_pix` flag

```python
# In your deploy CLI:
parser.add_argument("--per_pix", action="store_true",
                    help="Use per-pixel MLP model (model_dino_per_pixel.DinoPerPixelMLP) "
                         "instead of the query-MLP (DinoVolumeQuery) default")

# Loading branch:
if args.per_pix:
    from model_dino_per_pixel import DinoPerPixelMLP
    model = DinoPerPixelMLP(
        n_window=50, n_height_bins=32, n_gripper_bins=32, n_rot_bins=32,
        image_size=504, pred_size=56,
    )
    ckpt_path = "<mac local>/checkpoints/dino_per_pixel_izzy3_t50_v0/latest.pth"
else:
    from model_dino_volume_query import DinoVolumeQuery
    model = DinoVolumeQuery(n_window=50, ..., use_eef=True)
    ckpt_path = "<mac local>/checkpoints/dino_query_izzy3_t50_per_axis_v0/latest.pth"
```

### 3. Per-pixel inference interface (different from query model!)

The per-pixel model takes `query_pixels` (per-step pixel grid coords) instead of `start_pix`:

```python
out = model(rgb, query_pixels=query_pixels)
# query_pixels: (B, T=50, 2) of (y_grid, x_grid) — in pred_size=56 coords, NOT image coords
```

The query_pixels must be supplied per timestep — these are "where I think I'll be at step t".

**Two-pass deploy pattern** (similar to v9 teacher-forced inference):

```python
# Pass 1: get the spatial argmax to determine query pixels per timestep
out = model(rgb, query_pixels=None)         # returns volume_logits only
vol = out["volume_logits"]                   # (B, T, Z, H, W)
B, T, Z, H, W = vol.shape
flat = vol.reshape(B, T, -1).argmax(dim=-1)
yx = flat % (H * W)
y_grid = (yx // W).long().clamp(0, H - 1)
x_grid = (yx %  W).long().clamp(0, W - 1)
query_pixels = torch.stack([y_grid, x_grid], dim=-1)   # (B, T, 2)

# Pass 2: re-forward with the argmax query pixels to get grip/rot
out2 = model(rgb, query_pixels=query_pixels)
grip_logits = out2["gripper_logits"]         # (B, T, 32)
rot_logits  = out2["rotation_logits"]        # (B, T, 3, 32)
```

Slightly more expensive (two forward passes) but mirrors the v9 deploy pattern: pass 1 picks
the spatial location, pass 2 reads off the gripper/rotation at those locations.

(Optimisation if it matters: cache the trunk + per-pixel MLP outputs in pass 1 by skipping
the `query_pixels` path; the heavy compute is the DINO backbone which would be ~25% of total
cost. If two-pass is a problem, ping me and I'll add a `return_penult` flag.)

### 4. Rotation decode — same as query model (per-axis euler)

The `decode_rotation` function from the previous inbox message works identically — both
models use 3-axis 32-bin euler now.

### 5. Recommended default

For the first deploy session, **default to per-pixel** (without `--per_pix` it should be query;
add `--per_pix` to switch). Cameron wants to compare both in closed-loop, so make sure both
work cleanly.

— backbones

---
## 2026-05-20 (late) — from backbones: query-MLP model on izzy3 — please wire into deploy

**Context.** New arch line, supersedes the v9 pca_t5L_v1 ckpt for deploy. Cameron wants this
running on Mac for inference.

### What changed vs v9 ckpt

| Piece | v9 (pca_t5L_v1, izzy2) | v10 (query, izzy3) — THIS ONE |
|---|---|---|
| Head | 5-layer transformer over 8 future keypoint tokens, with sampled F + sin(z) + sin(t) | **5-layer AdaLN-Zero MLP fed (eef_feat ⊕ cls), one query per timestep** |
| Spatial scoring | Factored bilinear `F · sin/cos_key` | **Dot product of per-step query against `F ⊕ Linear(sin_z)` (factored, no volume tensor materialised)** |
| Rotation | 1D PCA, 48 bins | **per-axis euler, 3 × 32 bins** — PCA dropped (izzy3 has true 2-axis rotation, PC1 EV only 0.54) |
| Horizon | T=8 | **T=50** with last-frame padding |
| Training data | izzy_home_recording_2 (97 samples) | **izzy_home_recording_3** (687 samples, much denser timesteps) |

Final train fit: train_pix=9.3 px, train_v=1.07, train_g=0.47, train_r=1.01 (mean of 3-axis CE).

### 1. New checkpoint to copy to Mac

**Server path**: `/data/cameron/para/libero/checkpoints/dino_query_izzy3_t50_per_axis_v0/latest.pth` (117 MB)
**Replaces** v9 ckpt as the default deploy ckpt.

The ckpt embeds (read these from `sd[...]`):
```python
sd["min_height"], sd["max_height"]   # 0.0288, 0.2179 (izzy3 range, padded 5%)
sd["min_grip"],   sd["max_grip"]     # 0.0033, 1.4156
sd["min_rot"], sd["max_rot"]         # both (3,) — per-axis euler ranges
sd["n_rot_bins"]                     # 32
sd["args"]["use_eef"]                # 1 — confirms EEF+CLS mode (vs cls_only ablation)
sd["args"]["n_window"]               # 50
```
No PCA basis fields anymore.

### 2. New model file to copy

**Server path**: `/data/cameron/para/libero/model_dino_volume_query.py`

Drop-in replacement for the v9 `model_dino_volume_kv_full.py`. Public interface:
```python
m = DinoVolumeQuery(
    n_window=50, n_height_bins=32, n_gripper_bins=32, n_rot_bins=32,
    image_size=504, pred_size=56,
    use_eef=True,                    # IMPORTANT — must match training mode
)
m.load_state_dict(sd["model_state_dict"], strict=False)   # strict=False ignores any benign extras

out = m(rgb, start_pix=start_pix)    # rgb: (B,3,504,504); start_pix: (B,2) in 504-image coords
# out["volume_logits"]:    (B, T=50, Z=32, H=56, W=56)
# out["gripper_logits"]:   (B, 50, 32)
# out["rotation_logits"]:  (B, 50, 3, 32)   ← 3-axis euler now (NOT 1D PCA)
# out["pixel_feats"]:      (B, 32, 56, 56)
```

### 3. Decode rotation bins → Euler (replaces 1D PCA decode)

```python
import numpy as np
def decode_rotation(rot_bins_3, sd):
    """rot_bins_3: (3,) int — argmax bins per axis.
       Returns (3,) Euler angles in same convention as training (XYZ Euler from quaternion)."""
    lo = np.asarray(sd["min_rot"], dtype=np.float32)
    hi = np.asarray(sd["max_rot"], dtype=np.float32)
    n  = sd["n_rot_bins"]
    norm = (rot_bins_3.astype(np.float32) + 0.5) / n        # bin center, in [0, 1)
    return lo + norm * (hi - lo)

# In your deploy loop:
rot_argmax = out["rotation_logits"].argmax(dim=-1)   # (B, T, 3)
# Per timestep euler:
eulers = decode_rotation(rot_argmax[0, t].cpu().numpy(), sd)
```

### 4. The crucial new piece — `start_pix` input

Unlike v9 (which only needed `rgb` + ran teacher-forced keypoints at inference), this model
needs the **current EEF pixel** at each forward call. Compute it the same way the dataset
does: project the current EEF world position through the camera, get (u, v) in IMG_SIZE=504
coordinates.

If you already do this for the keypoint overlay viz, that same (u, v) tuple is what to pass.
The model uses it to sample the feature at that pixel as part of the global query — i.e., it's
the "where am I now" signal that conditions the whole 50-step trajectory prediction.

At inference:
```python
start_pix = torch.tensor([[eef_u_504, eef_v_504]], device=device, dtype=torch.float32)  # (1, 2)
out = m(rgb, start_pix=start_pix)
```

### 5. Spatial argmax decode (unchanged from v9)

```python
B, T, Z, H, W = out["volume_logits"].shape
flat = out["volume_logits"].reshape(B, T, -1).argmax(dim=-1)   # (B, T)
z_pred = flat // (H * W)
yx     = flat %  (H * W)
y_pred = (yx // W).float() * (IMG_SIZE / H)   # back to 504-space
x_pred = (yx %  W).float() * (IMG_SIZE / W)
# Then 3D recovery via camera intrinsics + height bin center, same as v9.
```

### 6. Caveat

Trained on izzy_home_recording_3 (a different home scene than izzy2). Should generalize but
no guarantees — if the scene looks very different on deploy, expect some quality loss. The
T=50 horizon also means: at deploy if you only want the first K=8 steps, just slice the output
— but the model was trained with all 50 supervised (padded with the last real frame), so
later timesteps for short trajectories may converge to "freeze at end pose."

### 7. CLS-only ablation also available (skip unless requested)

There's a sibling ckpt at `dino_query_izzy3_t50_per_axis_cls_only_v0/latest.pth` that uses
`use_eef=False` (CLS token only, no EEF feature). Cameron wants to compare; the EEF+CLS one
is currently winning on train fit. Don't deploy this one — just keep around for ablation runs.

— backbones

---
## 2026-05-20 — from backbones: new pca_t5L_v1 ckpt + 1D PCA rotation decode

**Context.** Done with the "Model surgery" task — the gripper/rotation head got rebuilt today:
- Transformer head bumped 2 → 5 layers
- Learned z/t embeddings replaced with sinusoidal (matching parent `height_enc='sin'`)
- **Rotation collapsed from per-axis (3 × 32-bin) Euler to 1D PCA (single 48-bin)**

Trained 100 epochs on `izzy_home_recording_2` from v7. Final train losses: g≈0.02, r≈0.03. Cameron's eye-check mid-run: gripper bin strips are visibly more confident than v7. The 1D rotation decode is new — you'll need to update the deploy path.

### 1. New checkpoint to copy

**Server path**: `/data/cameron/para/libero/checkpoints/da3_dino_kv_full_izzy2_pca_t5L_v1/latest.pth` (369 MB)  
**Default-load this on Mac** (replaces the v7 path you were using before).

The ckpt now embeds (read these from `sd[...]`, no separate `_stats.json` needed):
```python
sd["rot_pca_mean"]   # (3,) float32
sd["rot_pca_axis"]   # (3,) float32 — PCA principal axis v1
sd["rot_pca_min"]    # float — -4.0267 for izzy2
sd["rot_pca_max"]    # float — +2.1456 for izzy2
sd["n_rot_bins"]     # int — 48
sd["min_height"], sd["max_height"]  # 0.0332, 0.1972
sd["min_grip"], sd["max_grip"]      # 0.0143, 1.5903
```

### 2. Updated model architecture (drop-in `model_dino_volume_kv_full.py`)

Diff vs the one you have:
- `TF_LAYERS = 5` (was 2)
- `N_ROT_BINS = 48` (was 32)
- `z_tok_emb`/`t_tok_emb` are gone — replaced with `z_sin`/`t_sin` buffers via `sinusoidal_features(n_height_bins, d_model)` and `sinusoidal_features(n_window, d_model)` (imported from `model_dino_volume_kv`)
- `rotation_out: nn.Linear(d_model, n_rot_bins)` — single head, NOT `Linear(d_model, 3 * n_rot_bins)`
- `out["rotation_logits"]` is now `(B, T, 48)`, NOT `(B, T, 3, 32)`

Just `scp` `/data/cameron/para/libero/model_dino_volume_kv_full.py` to the Mac and load with the new ckpt — `strict=False` is no longer enough since `rotation_out` shape changed; use the shape-aware filter from `train_dino_volume_kv_full.py:268-285` if you also keep the old ckpt around.

### 3. Decode helper (rotation bin → Euler) — NEEDS to be added to deploy

```python
import numpy as np
def decode_rotation(rot_bin: int, sd: dict) -> np.ndarray:
    """rot_bin: int in [0, sd['n_rot_bins']). Returns (3,) Euler in same convention as training."""
    pca_min, pca_max = sd["rot_pca_min"], sd["rot_pca_max"]
    n = sd["n_rot_bins"]
    pca_val = pca_min + (rot_bin + 0.5) / n * (pca_max - pca_min)
    return sd["rot_pca_mean"] + pca_val * sd["rot_pca_axis"]   # (3,)
```

For batched inference (argmax over the 48 bins):
```python
rot_bin = out["rotation_logits"].argmax(dim=-1)   # (B, T) — was (B, T, 3) before
euler   = np.stack([decode_rotation(int(b), sd) for b in rot_bin.flatten().cpu().numpy()])
```

### 4. Caveat

PC1 explained-variance ratio was **0.81** (Cameron accepted, was instructed gate=0.85). That ~19% residual variance lives almost entirely on PC2 ≈ roll. For motions in `izzy_home_recording_2` that's fine because v1 is essentially pure yaw and the wrist barely rolls. **If Cameron records a session with significant roll later, the 1D decode will saturate on roll** and we may need to revisit (per-axis Euler or 2D PCA). Worth flagging if deploy looks off on a new dataset.

— backbones

---
## 2026-05-19 — from backbones: copy new dino_kv volume model to Mac + update deploy script

**Context.** PARA architecture has shifted to a new "dino_kv volume" formulation. Cameron
wants this on the Mac so he can run inference + extend the existing deploy script. There's
also a freshly-finetuned checkpoint specialised to his home environment (`izzy_home_first_record`
episode collected today) that should be the default-loaded model on the Mac.

### 1. Architecture summary

DINOv3 ViT-S/16+ backbone + a factored "KV-volume" head that produces a 4D output volume
`(T=8, Z=32, H=56, W=56)` of joint logits per future timestep.

- Per-pixel features `F ∈ R^(48, 56, 56)` from DINO patches → small refine conv → 48-d.
- Sinusoidal positional encoding of timestep (`t_emb`) and height bin (`h_emb`), each 48-d.
- Key per (t, z) = `t_emb[t] + h_emb[z]`, then **L2-normalise** both `F` and keys → cosine.
- Volume logits = `einsum("bchw, tzc -> btzhw", F_unit, keys_unit) * exp(logit_scale)`.
- Argmax over (Z, H, W) per timestep → (z*, y*, x*) voxel → 3D recovery via camera intrinsics
  + height-bin-center.

Plus on top of that, this finetune also adds:
- **Gripper MLP** (D=48 → 128 → 32 bins, CE) — predicts discretised q_motors[6].
- **Rotation MLP** (D=48 → 128 → 3 × 32 bins, CE per Euler axis) — predicts EEF orientation.

Both MLPs index `pixel_feats` at the GT pixel during training (teacher forcing), at the
volume-argmax pixel during inference.

### 2. Files to copy to the Mac

All under `/data/cameron/para/libero/` (lab server) → put under
`/Users/cameronsmith/Projects/robotics_testing/<some-subdir-you-pick>/` on the Mac. Suggest
mirroring as `~/Projects/robotics_testing/para_volume_kv/`.

- **Model code** (you'll need both files since one subclasses the other):
  - `model_dino_volume_kv.py` (base class)
  - `model_dino_volume_kv_full.py` (subclass adding gripper + rotation MLPs)
- **Dataset stats** — at inference time you don't need the dataset, BUT you need the
  rotation/gripper/height bin centers + ranges so you can DECODE bin indices back to
  continuous values. Pull these from the checkpoint's `args` dict OR re-compute by
  loading one frame and reading these dataset attributes:
  - `min_rot`, `max_rot` (per-axis Euler, list of 3 floats each)
  - `min_grip`, `max_grip` (gripper q_motors[6] range)
  - `min_height`, `max_height`, `bin_centers` (heights — already 32 bin centers in metres)
  - The izzy ckpt's `args` only stores hyperparameters, NOT these. **Easiest fix:** pickle
    the dataset stats next to the ckpt. I'm not doing it now — when you set up the Mac side,
    either load a tiny stats file I dump or hardcode them from the values I'll attach in
    this message after the file list.

- **Checkpoint**:
  - `checkpoints/da3_dino_kv_full_izzy_ft/latest.pth` (~357 MB)
    → Mac path: `~/Projects/robotics_testing/para_volume_kv/checkpoints/izzy_home_v0.pth`
  - This is the production model going forward. It was finetuned for 100 epochs from the
    smith300-diverse-pretrain ckpt onto the single izzy_home_first_record episode at lr=2e-5.

- **DINOv3 weights and repo** — the model loads DINO via `torch.hub.load` from a local repo:
  - Repo: `/data/cameron/keygrip/dinov3/` (on the lab server). Mac equivalent: probably
    `~/Projects/robotics_testing/random/dinov3/` or wherever the older Mac builds expected
    it. The model code reads `DINO_REPO_DIR` and `DINO_WEIGHTS_PATH` env vars and falls
    back to lab paths; on the Mac set:
      - `DINO_REPO_DIR=/Users/cameronsmith/Projects/robotics_testing/random/dinov3`
      - `DINO_WEIGHTS_PATH=.../weights/dinov3_vits16plus_pretrain_lvd1689m-4057cbaa.pth`
  - The S/16+ weights file is ~340 MB. The Mac may already have it from older work.

### 3. Dataset stats (so you can decode argmax bins on the Mac)

Computed today over the smith300 first_mobile_collection dataset:

  - `n_window = 8`, `n_height_bins = 32`, `n_rot_bins = 32`, `n_gripper_bins = 32`
  - Height: `min_height = 0.0291 m`, `max_height = 0.1954 m`. Bin width ≈ 5.2 mm.
    Center of bin `z` (in metres) = `0.0291 + (z + 0.5) * (0.1954 - 0.0291) / 32`.
  - Rotation (euler XYZ, radians), per axis:
      - axis 0: min=−3.4528, max=+3.4541 (full ±π-ish, roll)
      - axis 1: min=−1.5857, max=−1.1964 (narrow, pitch — gripper points down)
      - axis 2: min=−3.4518, max=+3.4374 (yaw)
    Center of bin `b` on axis `a`: `min[a] + (b + 0.5) * (max[a] - min[a]) / 32`.
  - Gripper (q_motors[6] convention from smith300 — positive = closed):
    min=−0.0742, max=1.6925. Bin width ≈ 0.055.

### 4. Inference recipe on the Mac

```python
import torch, sys
sys.path.insert(0, "~/Projects/robotics_testing/para_volume_kv")
sys.path.insert(0, "~/Projects/robotics_testing/random/dinov3")
from model_dino_volume_kv_full import DinoVolumeKVFull

device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
m = DinoVolumeKVFull(height_enc='sin', time_enc='sin',
                     dino_variant='dinov3_vits16plus').to(device).eval()
sd = torch.load("checkpoints/izzy_home_v0.pth", map_location=device, weights_only=False)
m.load_state_dict(sd["model_state_dict"], strict=True)   # strict ok — full model

# rgb: (3, 448, 448) float32 in [0,1]; gt_pix_504 only needed during training.
# At inference, no query_pixels → no grip/rot heads. To get grip/rot, pass argmax-of-volume.
with torch.no_grad():
    out = m(rgb.unsqueeze(0).to(device))           # no query — gets volume only
    vol = out["volume_logits"][0]                  # (8, 32, 56, 56)
    pf  = out["pixel_feats"]                       # (1, 48, 56, 56)
    # Argmax per timestep → (z*, y*, x*) at 56-grid scale
    flat = vol.reshape(8, -1).argmax(dim=-1)
    pred_z = flat // (56 * 56)
    pred_yx = flat % (56 * 56)
    pred_y = pred_yx // 56
    pred_x = pred_yx %  56
    # Scale pixel back to 504-image space
    pred_pix_504 = torch.stack([pred_x.float() / (56/504), pred_y.float() / (56/504)], dim=-1)
    # Get grip/rot by re-running forward with query=pred pixels
    out2 = m(rgb.unsqueeze(0).to(device), query_pixels_504=pred_pix_504.unsqueeze(0))
    grip_logits = out2["gripper_logits"][0]        # (8, 32)
    rot_logits  = out2["rotation_logits"][0]       # (8, 3, 32)
# Decode bins → continuous values using stats above.
```

### 5. Deploy script update

The current Mac deploy script (Cameron mentioned `deploy_ik_sequence.py` exists at
`/data/cameron/para/panda_streaming/deploy_ik_sequence.py` on the lab; presumably there's
a Mac counterpart in the smith300 streaming setup) needs:

  - Swap model loading from the old DA3-pixel/DA3-volume class to `DinoVolumeKVFull`.
  - Update the inference call signature (now needs RGB only on first pass; second pass
    with `query_pixels_504` for grip/rot decoding).
  - Update the 3D recovery step: argmax gives `(z_bin, y_grid, x_grid)`. Convert:
      - pixel u = x_grid × (504 / 56), v = y_grid × (504 / 56), in 504-space (or rescale
        further to whatever the camera intrinsics expect).
      - world Z = `bin_centers[z_bin]` (the height table above).
      - Rest of the unprojection (`recover_3d_from_direct_keypoint_and_height`) is the same.
  - Add gripper decode: `gripper_value = bin_center(argmax(grip_logits[t]), grip_min, grip_max)`.
  - Add rotation decode: per-axis `euler[a] = bin_center(argmax(rot_logits[t, a]), min_rot[a], max_rot[a])`.

The deploy script's IK + arm command sending is unchanged — just the
"image → (xyz, euler, gripper) per timestep" prediction step gets the new model.

### 6. Verification

Once you have it copied, please:
- Load the model on Mac with MPS and confirm `volume_logits.shape == (1, 8, 32, 56, 56)`.
- Run on one of the izzy_home rgb_*.jpg frames (they're on the Mac at the original
  capture path — let me know where if you can't find them, the lab copy is at
  `/data/cameron/mac_robot_datasets/first_mobile_collection/izzy_home_first_record/`).
- Save a visualisation: argmax keypoints overlaid on the RGB. If it looks reasonable,
  proceed with the deploy script wiring. If the trajectory looks scrambled, ping me.

### 7. Priority

Cameron asked for this now — should be a near-term task. The figure_maker also has a
parallel diagram task using the SAME architecture (see their inbox `volume_kv_method`),
no conflict — different work.

---
## Task forwarded from data_visualizer — 2026-05-07

Cameron dropped this in my inbox but the work lives Mac-side (MuJoCo + smith300 XML + UMI assets), so passing it on.

**1. Add gripper-value HUD to overlay-bakers.** All overlay frames should get a black box in the top-right corner with white text reading `gripper value: <X>` where X is `q_motors[t, 6]` (gripper column for both smith300 and UMI joints.npz). Apply to both bakers (smith300 mesh overlay, umi-rendered) so future overlays for both rigs include this annotation. Keep existing overlay content (robot mesh, EEF crosshair, axes triad) intact — this is just an additional HUD element. Suggested format: round to 3 decimal places, monospace font if PIL/cv2 supports it, ~14–18px text, ~8px padding.

**2. Bake overlays for the two new sessions** I just registered as raw-frames-only:
- `robot_pick_cup` → `/data/cameron/mac_robot_datasets/robot_pick_cup` (617 frames, smith300 7-motor)
- `umi_pick_cup` → `/data/cameron/mac_robot_datasets/umi_pick_cup` (857 frames, mode=umi-rendered)

Once baked, ping me and I'll add `_overlay` sister entries (or repoint the existing entries to the overlay path) — same pattern as smith300_2026_05_05 / umi_2026_05_06.

I'm doing the keyframes-annotation tool half of Cameron's task in the viewer in parallel.
