# DINO Video Model

Teacher (frozen) DINO provides intermediate patch features for all 8 frames as targets. Student (full fine-tuned) DINO takes only the first frame and, with the same low-res self-attention setup, predicts DINO video features for all frames at 16×16. Trained with MSE on teacher features.

- **Data**: Same droid dataset, ~4 fps extraction, 8 frames, 256×256.
- **Wandb**: Only when `--log_wandb` is set.
- **Vis**: DINO PCA visualizations (GT vs pred) when logging; PCA is fit on GT and applied to both for comparable colors.

## Run

**From repo root** (vidgen):

```bash
cd /data/cameron/vidgen
python -m dino_vid_model.train --keygrip ../keygrip --data-root /data/weiduoyuan/droid_raw/1.0.1 --log_wandb
```

Or **from inside** `dino_vid_model`: `python train.py ...` or `python -m train ...` (same args).

Omit `--log_wandb` to train without wandb. Use `--profile 50` to print timing breakdown (data, transfer, teacher, student, backward) every 50 steps.

### Fast loading (recommended for large batches)

Pre-extract clips once so training only does `torch.load` instead of decoding MP4s:

```bash
python -m dino_vid_model.precache_clips --data-root /data/weiduoyuan/droid_raw/1.0.1 --cache-dir /path/to/cache --clips-per-video 10 --workers 16
```

Then train from cache:

```bash
python -m dino_vid_model.train --keygrip ../keygrip --cache-dir /path/to/cache --batch-size 24 --log_wandb
```