# Inbox

## 2026-05-20 — from backbones: NEW figure — query-MLP arch diagram

**Pitch text (Cameron's words):** RGB → DINO PCA → sparse 3D volume (uniform low-stride
downsample) → point-sampled feature 'lifted under image' between F and the volume →
positional encoding of height and time concat with feature → separately EEF feature
produces spatial query via MLP → dot product with volume = heatmap → probability volume
with heatmap colouring → argmax for 3D target location.

**Spec + intermediates:** `/data/cameron/para/paper/figs/data/query_arch/SPEC.md`

The spec includes the full panel manifest (RGB, F-PCA, F-PCA-with-EEF-marker, feature-volume
3D scatter, probability-volume 3D scatter, argmax-highlighted volume), the proposed
horizontal layout with the EEF→MLP→query side branch, and the style notes (matches the
`volume_kv_method.svg` precedent).

All PNG intermediates are already pre-rendered there. The NPZ has raw tensors if you want
to redo any panel from scratch.

**Output expected:** `/data/cameron/para/paper/figs/svg/query_arch_method.svg` via a builder
at `/data/cameron/penpot/build_query_arch_diagram.py`. Same pattern as the existing
`build_volume_kv_diagram.py` (matplotlib panels → base64 → hand-authored SVG).

— backbones

---

## 2026-05-19 — from backbones: NEW figure — dino_kv volume architecture diagram

**Context:** PARA architecture has settled on the "dino_kv volume" formulation as the
production model. The current method figure doesn't communicate it well. Cameron wants a
new figure that shows the KV-factored volume head mechanism visually.

**Architecture being illustrated** (`model_dino_volume_kv.py`):
- RGB → DINOv3 ViT-S/16+ → patch tokens → refined → per-pixel features F ∈ R^(B × 48 × 56 × 56)
- Height embeddings h_emb (32 bins, sinusoidal) + time embeddings t_emb (8 bins, sinusoidal)
- Key per (t, z) = `t_emb[t] + h_emb[z]` ∈ R^48
- L2-normalize F and keys → cosine similarity scaled by learnable temperature
- Volume logits = einsum("bchw, tzc → btzhw", F_unit, keys_unit) × exp(logit_scale)
- Output: (B, T=8, Z=32, H=56, W=56) joint logits

**Final composite numbers** (smith300, train-set 2D argmax pix err):
- dino_kv (sin/sin) S/16+: **9.32 px** ← Cameron's chosen production default
- Volume z err ≈ 2.0 mm (0.4 height bins)
- Joint top-1 accuracy ≈ 22% (vs 0.5% for DA3 baseline)
- See `/data/cameron/agents_stuff/agents/backbones/status.md` for the full leaderboard table.

### Requested figure layout

**Single panel, horizontal flow** showing one inference step at a chosen example pixel/voxel:

  [RGB image @ top-left]                       [DINO PCA inset, small]
       │
       ▼ (DINO encoder)
  [Per-pixel feature map F (label: 48-d per pixel)]
       │ (drop down arrow from a specific pixel — example pixel = GT pixel for chosen scene)
       ▼
  [Voxel grid: 4 stacked z-slices for one chosen timestep, label "volume @ t=4"]
       ▲ (arrow up: the same voxel projects up to that pixel)
  [Key embedding glyph (t_emb[4] + h_emb[12]) → 48-d, labeled]
       │
       ▼
  [Dot product node showing F(u,v) · key(t,z) = scalar response]
       │
       ▼
  [Response heatmap volume — same shape & viewpoint as voxel grid but colored by softmax(logits)]
       │ (argmax → bright voxel)
       ▼
  [Recovered 3D point: (u*, v*, world-Z bin center) — shown both as a crosshair on RGB
    AND as a highlighted voxel in the response volume]

### Concrete params to encode in the figure

- N_WINDOW = 8 (label one chosen t somewhere — e.g. t=4)
- N_HEIGHT_BINS = 32 (show 4 slices: z = 4, 12, 20, 28 with their world-Z values from
  the dataset stats: min_height=0.029m, max_height=0.195m → ~5.2mm per bin)
- Pred grid = 56×56 (label "H_out × W_out")
- KEY_DIM = 48 (label both the F vector and the key vector dims)
- DINO embed = 384 (mention the projection from 384 → 48)

### Builder location + style

- New SVG builder: `/data/cameron/penpot/build_volume_kv_diagram.py` (create it).
- Output SVG: `/data/cameron/para/paper/figs/svg/volume_kv_method.svg`
- Output PNG: `/data/cameron/para/paper/figs/generated/volume_kv_method.png`
- Style: same hand-authored-SVG-via-Python convention as the other paper figures
  (see `feedback_paper_figures_svg.md` in memory). NOT matplotlib for the whole figure;
  matplotlib is fine for *individual sub-panels* (the volume cube rendering, the PCA inset)
  if rasterized and embedded in the SVG.

### Suggestions on rendering the 3D volume in matplotlib

- Use `matplotlib.pyplot.axes(projection='3d')` with `voxels()` or alpha-blended scatter.
- For the 4 z-slices: draw as semi-transparent rectangular sheets at different heights with
  a thin grid for the (H_out, W_out) cells. Don't render all 56×56 cells — render at ~12×12
  for visual clarity. Cameron explicitly OK'd "view the volume in matplotlib".
- For the response heatmap volume: same sheets but colored by softmax probability with
  the inferno colormap, so the argmax voxel is visibly the brightest.

### Example data source — **already extracted for you**

A concrete inference example is pre-extracted at:
  `/data/cameron/para/paper/figs/data/volume_kv_example.npz`  (8.4 MB)

Load it with `np.load(..., allow_pickle=True)`. Keys (all numpy arrays unless noted):
  - `rgb` (3, 504, 504) float32 in [0, 1] — the input
  - `gt_pix_504` (8, 2) — GT EEF pixel per future timestep in 504-space
  - `gt_pix_grid` (8, 2) — same scaled to the 56×56 grid (for plotting on the volume)
  - `gt_z_bin` (8,) int64 — GT z-bin per timestep
  - `valid` (8,) bool — True for all on this sample
  - `height_meters` (32,) — world-Z at each bin center (label your z-axis with these)
  - `volume_logits` (8, 32, 56, 56) — raw scored logits
  - `volume_softmax` (8, 32, 56, 56) — softmax per timestep, **use this to color the response volume**
  - `argmax_voxel` (8, 3) int64 — (z*, y*, x*) per timestep at 56-grid scale (matches GT exactly
    on this sample — model fits training perfectly)
  - `pixel_feats_raw` (48, 56, 56) — F before L2-norm
  - `pixel_feats_unit` (48, 56, 56) — F after L2-norm (this is what enters the dot product)
  - `keys_unit` (8, 32, 48) — keys = `t_emb[t] + h_emb[z]` after L2-norm (other half of dot product)
  - `dino_patches` (28, 28, 384) — raw DINO patch tokens
  - `dino_pca_rgb` (28, 28, 3) float32 in [0,1] — precomputed 3-PCA RGB feature viz, ready to imshow
  - `meta` — JSON string with scale_504_to_grid, sample_idx, dims, etc. Decode with `json.loads(d['meta'].item())`.

The model fits this sample perfectly (argmax == GT for all 8 timesteps), so the response
volume has a visible bright peak right at the GT voxel — ideal for the figure.

Source extraction script: `/data/cameron/para/libero/extract_volume_kv_figure_data.py`
(re-run if you want a different sample — change the `sid` selection criterion).

### Open question for figure_maker

The 2-volume side-by-side (input voxel grid vs. response volume) might look redundant —
they have the same spatial extent and viewpoint. Cameron's spec is to show BOTH, but if
you find a tighter rendering (e.g. fade the input voxel grid's uniform color → coloured
response in a single 3D shape with the argmax voxel highlighted) propose it back via
outbox.md before building.

### Priority

Non-urgent. Cameron is iterating on the model still. But if you can build a v1 in the next
day, that gives him a visual to react to before he locks the figure for the paper.

---

## 2026-05-12 — from project_highlevel: URGENT — fig1_overview right-panel rebuild for advisor meeting at 3:30 today

**Context:** Cameron has an advisor meeting today at 3:30 PM. The paper has pivoted hard — single arm (custom arm), single task (pick cup), three OOD axes. Cross-embodiment is OUT. Data-efficiency framing is OUT. Video-as-policy is BONUS only.

The pitch has also shifted from **zero-shot generalization** to **train-on-diversity → robust to held-out conditions**. Numbers will look much better (~88% vs ~30%) and the framing is more defensible.

### Task

Rebuild the RIGHT-SIDE panel of `fig1_overview.svg` only. Left architecture/method panel stays untouched.

- Builder: `/data/cameron/penpot/build_fig1_with_3d_volume.py`
- Output SVG: `/data/cameron/para/paper/figs/svg/fig1_overview.svg`
- Output PNG: `/data/cameron/para/paper/figs/generated/fig1_overview.png`

### New right-panel structure

Replace the existing 4-card "Benefits of PARA" panel with **3 stacked rows**, one per OOD axis. Each row is a `[train-coverage diagram] → [test rollout frame]` pair.

**Critical:** the *train* illustrations now show **coverage of multiple training conditions**, not a single training condition. The story is "we trained on diversity, tested on a held-out condition of similar form." The held-out condition should be visually distinguishable (e.g., empty / open / red dot) from the trained ones.

| Row | Axis | Train viz (LEFT) | Test viz (RIGHT) |
|---|---|---|---|
| 1 | **New object position** | workspace grid with N training positions filled (solid dots), held-out positions marked open/red | rollout frame at a held-out position, ideally with PARA heatmap overlay visible. PARA ✓ / ACT ✗ result strip below |
| 2 | **New viewpoint** | polar / hemisphere diagram with 4 training views filled, held-out view(s) marked open/red — *new training plan: 4 viewpoints across 3 environments* | rollout frame from a held-out viewpoint, with heatmap overlay if available. PARA ✓ / ACT ✗ strip |
| 3 | **New environment** | 3 small training-env thumbnails (lab × 3 setups), 1 held-out env thumbnail marked | held-out environment rollout frame (ideally outdoor, sunny). PARA ✓ result |

### Asset sources (for placeholders — cup data doesn't exist yet)

For TODAY, use existing strongest results as placeholders. **Caption underneath panel:** *"In submission: cup task on custom arm — held-out condition rollouts in progress."*

- Row 1 (position): `exp3_leftright_distribution.png` for train viz; `objpos_lr_para_rollout_grid.mp4` first frame OR `ours_20episodes_pickplace_8x.mp4` for test
- Row 2 (viewpoint): `vp_default_to_all_polar_only.png` for train viz (already cropped and clean); `ours_newviewpoint_15ep_8x.mp4` first frame for test
- Row 3 (environment): use the new-environment frame from `ours_basemodel_newenv_8x.mp4` for test viz; for train viz, just stack three small lab-environment thumbnails (use frames from `ours_20episodes_pickplace_8x.mp4` since that's the SO-100 lab setup)

All assets are at `/data/cameron/para/.agents/reports/project_site/media/`.

### Optional but high-impact

A small **numbers strip across the bottom of the right panel**:

> Position OOD: PARA 97% vs ACT 9% &nbsp;|&nbsp; Viewpoint zero-shot: PARA 52% vs ACT 0% &nbsp;|&nbsp; New env: PARA 94% vs ACT 0%

Numbers are placeholders (SO-100 results) that anchor the claim while the cup data is collected. Caption these as anchor results.

### Style

- Keep typography, palette, border-radius consistent with the existing left panel and prior right-panel design
- Each row separated by ~10–12px gap, light border or background tint
- "✓" and "✗" badges in green/red as in your prior fig1 work
- The train viz should be SMALLER than the test viz in each row — test is the punchline

### Out of scope

- Left architecture/method panel (untouched)
- Heatmap overlay on test frames is *optional* — if you can render one from an existing PARA rollout, great; if not, plain frame is fine. The cup-task version will have heatmaps overlaid live.

### Deliverable

- Updated builder + rendered SVG + PNG
- Drop a screenshot/PNG of just the new right panel at `/data/cameron/agents_stuff/agents/figure_maker/outbox_fig1_right_panel_v2.png` and post to your outbox when done
- **Need this by 2 PM today** so Cameron has time to review before the 3:30 meeting

Hit me back via tmux or outbox if anything is unclear. The bottleneck is speed — placeholder assets are fine, structure is what matters.