# PARA — Pixel-Aligned Robot Actions

Robot policy learning that predicts end-effector actions as pixel-aligned heatmaps rather than global coordinate regression. The model reasons in image space — where task-relevant cues already live — then lifts predictions into 3D via a height constraint.

## Core Idea

Standard policy architectures regress end-effector motion in a global coordinate frame, forcing the network to jointly solve correspondence, geometry, and control. PARA instead:

1. **Predicts a dense heatmap volume** over the image: each pixel holds logits over `N_HEIGHT_BINS` height buckets along its ray
2. **Decodes 3D action** by taking the argmax pixel (2D location) + argmax height bin → unproject to 3D world point
3. **Predicts gripper/rotation** by indexing patch features at the GT EEF pixel (teacher forcing in training) or predicted pixel (inference), then passing through an MLP
4. **Trains with cross-entropy** over the discretized volume, gripper bins, and euler angle rotation bins

This keeps the prediction target spatially grounded to the image, improving viewpoint robustness and making generalization across camera placements more natural.

## Architecture

- **Backbone:** DINOv3 ViT-S/16 (custom variant, local weights)
- **Feature pipeline:** 28×28 patch features → bilinear upsample to 64×64 → 3× Conv2d(3×3)+GELU → per-pixel features
- **Volume head:** 1×1 conv → `(B, N_WINDOW, N_HEIGHT_BINS, 64, 64)` spatial logits
- **Gripper/Rotation heads:** index features at query pixel → LayerNorm → MLP (Linear+GELU+Linear)
  - Gripper: `(B, N_WINDOW, N_GRIPPER_BINS)` — discretized open/close
  - Rotation: `(B, N_WINDOW, 3, N_ROT_BINS)` — discretized euler angles (xyz)
- **Start keypoint conditioning:** learnable embedding injected at current EEF patch token
- **Window prediction:** jointly predicts next `N_WINDOW=6` timesteps

## Repo Structure

```
para/
├── README.md
├── CLAUDE.md                  ← project context for Claude Code
├── libero/
│   ├── CLAUDE.md              ← libero subproject context + server setup
│   ├── model.py               ← TrajectoryHeatmapPredictor (DINOv3 + heads)
│   ├── train.py               ← main training script
│   ├── data.py                ← LIBERO HDF5 dataloader + projection utilities
│   ├── eval.py                ← closed-loop LIBERO rollout evaluation
│   ├── utils.py               ← 3D unprojection geometry
│   ├── debug_libero_projection.py  ← sanity check: project GT EEF onto camera image
│   └── reconstruct_from_tuple_ik_libero.py
├── training/                  ← (planned) general BC training
├── panda_streaming/           ← (planned) real Panda robot streaming
├── video_training/            ← (planned) video pretraining pipeline
└── vlm/                       ← (planned) VLM integration
```

## Subprojects

### `libero/` — LIBERO Simulation Training

Train and prototype PARA on the [LIBERO](https://libero-project.github.io/) simulation benchmark.

**Quick start (Mac/MPS):**
```bash
python libero/train.py \
  --benchmark libero_spatial --task_id 0 --camera agentview \
  --max_demos 10 --batch_size 2 --epochs 1000 \
  --run_name para_libero_t0 --wandb_mode online

# Overfit sanity check
python libero/train.py ... --overfit_one_sample --epochs 200 --vis_every_steps 10

# Closed-loop eval
python libero/eval.py \
  --checkpoint libero/checkpoints/para_libero_t0/best.pth \
  --benchmark libero_spatial --task_id 0 --n_episodes 10
```

**W&B logging:**
- `train_step/loss`, `train_step/volume_loss`, `train_step/gripper_loss`, `train_step/rotation_loss`
- `train/loss`, `val/loss`, `val/pixel_error`, `val/height_error`, `val/gripper_error`
- `vis/train_strip`, `vis/val_strip` — per-timestep strips with EEF projection overlays

## Progress

### ✅ Completed

- **Core model** (`libero/model.py`): DINOv3 backbone → 64×64 conv feature pyramid → volume head + gripper/rotation MLP heads
- **LIBERO dataloader** (`libero/data.py`): loads HDF5 demos, renders via `OffScreenRenderEnv`, projects GT EEF to pixel coords; gripper supervision from `actions[:,6]` (not qpos)
- **Training pipeline** (`libero/train.py`): CE volume + gripper + rotation losses, teacher forcing, AdamW, checkpointing, W&B
- **Eval script** (`libero/eval.py`): closed-loop LIBERO rollout, open-loop window execution (N_WINDOW=6 steps), OSC_POSE delta actions, success rate logging
- **Rotation prediction**: euler angle discretization, CE loss, delta rotation for OSC control
- **Verified supervision**: debug tool + overfit sanity check confirm EEF projection and heatmap targets are correct
- **Bug fixes:** vertical flip convention, height range stats, gripper qpos bug (use actions not qpos mean), camera matrix mismatch (projection vs extrinsic), OSC action scaling

### 🔄 In Progress

- Server training run on `libero_spatial` task 0, all 50 demos, full CUDA run

## TODO

### Next

- [ ] **Multi-level DINOv3 features (UNet-style):** extract intermediate block features and fuse with skip connections for finer spatial resolution
- [ ] Evaluate success rate across all 10 LIBERO-Spatial tasks
- [ ] Train on `libero_goal`, `libero_object`, `libero_100`

### Longer-term

- [ ] Real robot: port to Panda streaming pipeline (`panda_streaming/`)
- [ ] Video pretraining: leverage internet robot video to pretrain heatmap volume representation
- [ ] VLM integration: language-conditioned policy
- [ ] Viewpoint robustness: train on one camera, eval on others
- [ ] Multi-camera fusion

## Device / Environment

```python
device = torch.device(
    "cuda" if torch.cuda.is_available() else
    "mps"  if torch.backends.mps.is_available() else
    "cpu"
)
```

- **Mac (local):** MPS — debugging, small runs
- **Lab server (USC GVL):** CUDA — full training runs (RTX 6000 Ada, 49GB VRAM)