# Aug-token 2v OOF — Results

**The first 2view variant to MATCH the 1view baseline on average.**

## Setup

- Model: `DinoVolumeQuery2View` with `fusion_mode='aug_token'`
- Key change vs prior 2v variants: one learned `F_oof_token` feature added to the wrist's scoring space. Wrist scores in-frustum voxels via standard `q_F_wrist · F_w_sampled[z,y,x]`; OOF voxels get **zero** wrist contribution (just BEV); the +1 abstain token's logit is `q_F_wrist · F_oof_token`. Softmax over (Z*H*W + 1) entries; CE target is always one of the spatial voxels.
- Training: 20 epochs on vp_train (400 demos, 5 phi values × 8 thetas × 10 demos)
- Loss: standard volume CE (over augmented Z*H*W+1) + gripper CE + 1D-PCA rotation CE
- Eval: 4 viewpoints × 10 episodes, `--teleport --zero_rotation --clean_scene`

## Final results

| Cell | Aug-token | 1v baseline | Δ |
|---|---|---|---|
| (0, 0) default | 100% | 100% | tie |
| (14, 45) in-dist left | 90% | 70% | **+20pp** |
| (10, 180) OOD back | 60% | 100% | **-40pp** |
| (14, 225) OOD back-left | 100% | 80% | **+20pp** |
| **Average** | **87.5%** | **87.5%** | **tie** |

## Comparison across all 2v variants tested

| Model | (0,0) | (14,45) | (10,180) | (14,225) | Avg |
|---|---|---|---|---|---|
| **1v query-MLP** | 100 | 70 | 100 | 80 | **87.5%** |
| **aug-token 2v** | **100** | **90** | 60 | **100** | **87.5%** |
| DA3 fusion (1v + depth) | 100 | 60 | 100 | 80 | 85% |
| VLM full (InternVL trunk) | 80 | 70 | 90 | 80 | 80% |
| OOF mask 2v (per-T learned logit) | 100 | 50 | 60 | 100 | 77.5% |
| 2v max-fusion | 90 | 80 | 70 | 50 | 72.5% |
| 2v image-concat | 60 | 60 | 70 | 90 | 70% |
| 2v dual-frustum | 70 | 50 | 60 | 90 | 67.5% |
| 2v sum (original) | 60 | hung | 50 | 80 | ~63% |

## Why aug-token works

The implicit bias in earlier 2v variants: OOF voxels got `bev + 0 + z + t` while in-frustum voxels got `bev + wrist_score + z + t`. The non-zero wrist score on in-frustum voxels biased argmax toward them regardless of accuracy.

Aug-token fixes this structurally:
1. OOF voxels get **zero** wrist contribution → no bias relative to BEV
2. Wrist's "I don't know" energy goes to the +1 abstain token → doesn't pollute spatial argmax
3. Same scoring mechanism (`q · feature`) — no new parameter types or heads, just one extra feature vector

The model learns when to put mass on the abstain token (when wrist isn't useful) vs. on a specific in-frustum voxel (when wrist confirms a spatial position).

## Caveat

Aug-token TIES 1v on average but has a different per-viewpoint profile. It wins big at (14, 45) and (14, 225) — viewpoints where BEV camera is rotated such that wrist provides complementary geometric info — but loses at (10, 180), where the back-flipped BEV may make 1v's RGB features confused while wrist's gripper-anchored view actually helps less than expected for some reason.

Worth investigating why (10, 180) is the failure case (perhaps eval-time camera projection for that viewpoint is degenerate). But the average parity with 1v shows the 2v formulation is no longer net-negative — the wrist now meaningfully contributes when its frustum is informative.

## Files

- Model: `/data/cameron/para/libero/model_dino_volume_query_2view.py` (fusion_mode='aug_token')
- Trainer: `/data/cameron/para/libero/train_libero_2view.py` (--fusion_mode aug_token)
- Eval: `/data/cameron/para/libero/eval_libero_2view_ood.py` (auto-detects aug shape, slices off +1 token for decoding)
- Ckpt: `/data/cameron/para/libero/checkpoints/libero_2view_augtoken_v0/latest.pth`
- Per-cell logs: `/data/cameron/para/libero/logs/augtoken_eval/`