# Phase B (DA3 fusion) + Phase C (VLM trunk) Results

## Consolidated comparison table (libero_spatial t0, 4 viewpoints × 10 ep each)

| Model | (0, 0) | (14, 45) | (10, 180) | (14, 225) | **Avg** |
|---|---|---|---|---|---|
| **1v query-MLP (baseline)** | **100%** | **70%** | **100%** | **80%** | **87.5%** |
| DA3 fusion (Exp B, 2 conv layers) | 100% | 60% | 100% | 80% | 85% |
| VLM full fine-tune (Exp C) | 80% | 70% | 90% | 80% | 80% |
| VLM vision-only fine-tune (Exp C) | (hung) | 40% | 90% | 80% | ~70-75% |
| 2v max-fusion | 90% | 80% | 70% | 50% | 72.5% |
| 2v image-concat | 60% | 60% | 70% | 90% | 70% |
| 2v dual-frustum | 70% | 50% | 60% | 90% | 67.5% |
| 2v sum-fusion | 60% | hung | 50% | 80% | ~63% |
| 2v POE | 70% | (partial) | (partial) | 40% | ~50%* |

\* incomplete

## Findings

**1v query-MLP with DINOv3-S/16+ is the winner.** None of the architectural enhancements we tried — adding a second view (wrist camera), fusing geometric features from DA3-LARGE, or swapping the trunk for a vision-language model — reliably improves on the simple 1v baseline for libero_spatial task 0.

Specific results:
- **DA3 fusion**: tied with 1v baseline (85% vs 87.5%, within noise). Depth-aware features don't add useful info for this task — likely because the BEV's planar table geometry doesn't have non-trivial depth ambiguity that 1v can't already resolve from RGB.
- **VLM full fine-tune**: 80% (−7.5pp). Vision encoder + LLM both fine-tuned end-to-end. The language-conditioned features didn't beat DINO. Possibly the task prompt was static (one task only), so the language conditioning had no role.
- **VLM vision-only fine-tune**: ~70-75% (−15pp). Worse than full. The frozen LLM doesn't usefully condition the vision features.
- **All 2-view variants**: 50-72.5% — net negative.

## Implications

**For libero_spatial t0 with sufficient training variation, RGB-only DINO is enough.** The task is simple enough that geometric or language priors don't help, and additional modalities can introduce noise.

This is a consistent negative result across three categories of enhancement (second view, depth, VLM). The implication: to beat 1v query-MLP, we'd need either (a) a harder benchmark where BEV alone is insufficient (occlusion-heavy scenes, manipulation under shelves, cluttered desks), or (b) a fundamentally different architecture (e.g. an action-chunking diffusion head, or a learned planning module on top).

## Files

- Models: `/data/cameron/para/libero/model_dino_da3_fusion.py`, `model_vlm_query.py`
- Trainers: `train_libero_da3fusion.py`, `train_libero_vlm.py`
- Evals: `eval_libero_da3fusion_ood.py`, `eval_libero_vlm_ood.py`
- Checkpoints: `checkpoints/libero_da3fusion_v0/`, `libero_vlm_full_v0/`, `libero_vlm_vision_v0/`
- Per-cell logs: `logs/da3_eval/`, `logs/vlm_full_eval/`, `logs/vlm_vision_eval/`