# PARA — Pixel-Aligned Robot Actions

## What This Project Is

PARA reformulates end-effector robot action prediction as a pixel-aligned objective. Instead of the standard global-image → global-descriptor → action regression pipeline, PARA predicts a **dense heatmap volume** over the image: each pixel predicts logits over a set of height buckets along that pixel's ray.

**Why:** Standard policy architectures force the model to implicitly solve correspondence, geometry, and control in an unstructured output space. PARA lets the model reason in the image frame, where task-relevant cues already live, then lifts predictions into 3D via height.

## Core Formulation

- **2D localization:** model predicts a heatmap over (H, W) per timestep → argmax gives (u, v) pixel
- **Height prediction:** per-pixel logits over `N_HEIGHT_BINS` height buckets (world-frame Z) → argmax gives height bin
- **3D recovery:** given (u, v) and predicted height, recover 3D point via camera intrinsics + height constraint
- **Gripper/rotation:** predicted by indexing the feature map at the GT pixel (teacher forcing in train) or argmax pixel (inference)
- **Loss:** cross-entropy on 2D pixel location (flattened H×W) + cross-entropy on height bins + cross-entropy on gripper bins
- **Window:** predicts next `N_WINDOW=12` timesteps jointly

## Architecture

- **Backbone:** DINOv3 ViT-S/16 (custom variant, local weights)
  - Weights: `/Users/cameronsmith/Projects/robotics_testing/random/dinov3/`
  - Embed dim: read from `model.dino.embed_dim`
- **Heads:** 1×1 conv on patch features, bilinearly upsampled to (H, W)
  - Volume head: `(B, N_WINDOW * N_HEIGHT_BINS, H, W)`
  - Gripper head: `(B, N_WINDOW * N_GRIPPER_BINS, H, W)`
- **Start keypoint conditioning:** learnable embedding added to the patch token at the current EEF pixel location

## Key Hyperparameters

| Parameter | Value |
|---|---|
| Image size | 448 × 448 |
| N_WINDOW | 12 timesteps |
| N_HEIGHT_BINS | 32 |
| N_GRIPPER_BINS | 32 |
| Gripper range | [-0.2, 0.8] |
| DINO patch size | 16 |

## Device / Environment

```python
import torch
device = torch.device(
    "cuda" if torch.cuda.is_available() else
    "mps" if torch.backends.mps.is_available() else
    "cpu"
)
```

- **Mac (local):** MPS — good for debugging, small batch sizes
- **Lab server:** CUDA — full training runs

## Data Paths

| Location | Path |
|---|---|
| Mac — scratch data | `scratch/` (relative to script) or `/Users/cameronsmith/Projects/robotics_testing/3dkeygrip/volume_dino_tracks/scratch/` |
| Lab server — data | TBD — update this when confirmed |
| DINO weights (Mac) | `/Users/cameronsmith/Projects/robotics_testing/random/dinov3/` |
| DINO weights (server) | TBD — update this when confirmed |

## Data Format

Episodes are directories under `dataset_root/`, each named `episode_NNN/`. Per frame `NNNNNN`:
- `NNNNNN.png` — RGB image
- `NNNNNN_gripper_pose.npy` — (4,4) gripper world pose
- `NNNNNN_camera_pose.npy` — (4,4) world-to-camera transform
- `NNNNNN_cam_K.npy` — (3,3) normalized camera intrinsics
- `NNNNNN.npy` — joint state vector (last element = gripper value)

## Repo Structure

```
para/
├── CLAUDE.md                  ← you are here: global project context
├── configs/
│   └── base.yaml              ← shared hyperparameters
├── training/
│   ├── CLAUDE.md              ← general BC training context
│   ├── model.py
│   ├── train.py
│   ├── data.py
│   └── utils.py
├── libero/
│   ├── CLAUDE.md              ← Libero benchmark context
│   └── ...
├── panda_streaming/
│   ├── CLAUDE.md              ← real Panda robot + streaming context
│   └── ...
├── video_training/
│   ├── CLAUDE.md              ← video pretraining pipeline context
│   └── ...
└── vlm/
    ├── CLAUDE.md              ← VLM integration context
    └── ...
```

## Reference Implementation

The original prototype lives at:
`/Users/cameronsmith/Projects/robotics_testing/3dkeygrip/volume_dino_tracks/`

Key files: `model.py`, `train.py`, `data.py`, `utils.py`

## Current Status

- [ ] First training task TBD — awaiting instructions
