# Query-MLP architecture figure — SPEC for figure_maker

**Asks:** build a publication-quality SVG (style matching existing `build_volume_kv_diagram.py`)
illustrating the new query-MLP PARA head architecture. Cameron's pitch text for the figure:

> "RGB → DINO PCA → sparse 3D volume (uniform low-stride downsample) → point-sampled
> feature 'lifted under image' between F and the volume → positional encoding of height
> and time concat with feature → separately EEF feature produces spatial query via MLP
> → dot product with volume = heatmap → probability volume with heatmap colouring →
> argmax for 3D target location."

## Where the intermediates live

All pre-rendered PNGs and the raw NPZ are in:

`/data/cameron/para/paper/figs/data/query_arch/`

| File | What it shows | Native size |
|---|---|---|
| `rgb.png` | Input RGB frame (sample 414, izzy3 train) | 504×504 |
| `f_pca.png` | F feature map → 3-PCA → RGB | 56×56 |
| `f_pca_eef.png` | F PCA with a red bullseye at the current EEF pixel (start_pix) | ~800×800 |
| `feature_volume.png` | Sparse 3D scatter of voxels colored by F PCA at their (y, x) — the "feature volume" before scoring | ~1100×1100, transparent |
| `prob_volume.png` | Top-25 voxels by softmax probability at T*=6, coloured with plasma colormap — the "probability volume" | ~1100×1100, transparent |
| `prob_volume_argmax.png` | Same top-K but with the per-step argmax voxel highlighted (bright green w/ black outline) | ~1100×1100, transparent |
| `arch_overview.png` | Debug 2×3 composite for sanity (don't embed — just for reference) | — |
| `example.npz` | Raw tensors if you need additional renders (volume_logits, F, gt_pix, start_pix, argmax voxels, etc.) | — |

Sample chosen: izzy3 sample 414 (n_valid=50, trajectory span=250 px, T*=6). Ckpt:
`dino_query_izzy3_t50_pca1d_v0/latest.pth` (EEF+CLS query-MLP, 1D PCA rotation).

## Layout (proposed — feel free to refine)

A wide horizontal flow with one side branch for the EEF→MLP query path. Sketch:

```
                                                   ┌──────────────────┐
                                                   │ sin(t) AdaLN-Zero │
                                                   │ conditioning      │
                                                   └────────┬──────────┘
                                                            │
[rgb] ──► [DINO ViT-S/16+] ──► [F (DINO PCA)]               ▼
                                     │              ┌──────────────────┐
                                     │ ┌──────────► │  5-layer Res-MLP │ ──► q  (per-t query)
                                     │ │ EEF + CLS  │  (AdaLN on t)    │           │
                                     │ │            └──────────────────┘           │
                                     ▼ │ broadcast (z,t)                           │
                              [Feature volume V                                    │
                               sparse 3D, F-PCA                                    │
                               colors; + sin(z)                                    │
                               + sin(t) concat]                                    │
                                     │                                             │
                                     │   ◄─────────────── dot product q · V ───────┘
                                     ▼
                             [Probability volume
                              top-K voxels, plasma]
                                     │
                                     ▼
                              [argmax → (z*, y*, x*)]
                              one bright green voxel
```

### Panel placements (rough — please position on the canvas as you see fit):

1. **`rgb.png`** — far left, ~180×180 px.
2. **`f_pca.png`** (no EEF marker) — between DINO and the volume; ~180×180 px. Add a small
   "DINO ViT-S/16+ + 1×1 conv" arrow label between rgb and f_pca.
3. **`feature_volume.png`** — large central panel, ~360×360 px. Right of f_pca. Add a
   double-headed arrow / line from a few pixels in f_pca up into the volume to indicate
   "each voxel samples F at its (y, x)" (the "between F and the volume" hint).
4. **Side branch — EEF→MLP→query:**
   - **`f_pca_eef.png`** — small (~120×120) above the central volume, showing the EEF
     pixel sample location.
   - Then a small labelled box "5-layer Res-MLP w/ AdaLN-Zero(sin(t))" — figure_maker draws.
   - Then a labelled small box / icon "q  ∈ ℝ⁶⁴ (per-t query)".
5. **Dot product symbol** between query branch output and the feature volume — circle with
   a dot, or "q · V" text with arrow.
6. **`prob_volume.png`** — large, right of the dot product. ~360×360 px.
7. **`prob_volume_argmax.png`** — far right, ~360×360 px. Or stack vertically with
   prob_volume to show the argmax as a final step.
8. **Caption text under the rightmost panel**: "argmax → (x*, y*, z*) — 3D target voxel"

## Style notes (consistent with `volume_kv_method.svg`)

- White background, no grid lines around the 3D scatters (they're already transparent PNGs).
- Sans-serif font (Helvetica/Inter) for labels.
- Use Cameron's brand-y blue/teal for arrows (look at the existing figures for the exact hex).
- Section labels in small caps below each panel.
- Math: the dot-product formula
  `score(z, y, x) = ⟨q_F, F[y,x]⟩ + ⟨q_z, sin_z[z]⟩ + ⟨q_t, sin_t[t]⟩`
  somewhere in or near the dot-product step (small font, monospace OK).

## Optional polish (do or skip — your call)

- **DINO patch tokens inset** between rgb and f_pca — small 28×28 grid stylised tile, just to
  hint that DINO outputs patch tokens that get refined into F.
- **Highlight the EEF pixel→F sampling**: a thin line from the red EEF marker in f_pca_eef
  down into the feature volume, into one of its voxels — visualises "feature gets sampled".
- **Multiple-timestep version**: just one t shown in the figure for clarity. Caption could
  say "(shown for t=6; the volume head produces one per t=0…49)" — optional.

## Don't

- Don't embed `arch_overview.png` — that's only my debug composite.
- Don't redo the 3D renders from scratch (the PNGs are good; just compose).
- Don't show all 50 timesteps — pick one and note "per-t" in caption.

## Hand-off

Cameron explicitly said: ultrathink the layout, then hand to figure_maker for the SVG
composition. Output expected at `/data/cameron/para/paper/figs/svg/query_arch_method.svg`
(matching the volume_kv_method pattern), with the builder script at
`/data/cameron/penpot/build_query_arch_diagram.py`.

If anything in this spec is ambiguous, please ask before building — the worst outcome is
a beautifully composed figure with the wrong panel order.

— backbones
