# PARA — Pixel-Aligned Robot Actions

## What This Project Is

PARA reformulates end-effector robot action prediction as a pixel-aligned objective. Instead of the standard global-image → global-descriptor → action regression pipeline, PARA predicts a **dense heatmap volume** over the image: each pixel predicts logits over a set of height buckets along that pixel's ray.

**Why:** Standard policy architectures force the model to implicitly solve correspondence, geometry, and control in an unstructured output space. PARA lets the model reason in the image frame, where task-relevant cues already live, then lifts predictions into 3D via height.

## Core Formulation

- **2D localization:** model predicts a heatmap over (H, W) per timestep → argmax gives (u, v) pixel
- **Height prediction:** per-pixel logits over `N_HEIGHT_BINS` height buckets (world-frame Z) → argmax gives height bin
- **3D recovery:** given (u, v) and predicted height, recover 3D point via camera intrinsics + height constraint
- **Gripper/rotation:** predicted by indexing the feature map at the GT pixel (teacher forcing in train) or argmax pixel (inference)
- **Loss:** cross-entropy on 2D pixel location (flattened H×W) + cross-entropy on height bins + cross-entropy on gripper bins
- **Window:** predicts next `N_WINDOW=12` timesteps jointly

## Architecture

- **Backbone:** DINOv3 ViT-S/16 (custom variant, local weights)
  - Weights: `/Users/cameronsmith/Projects/robotics_testing/random/dinov3/`
  - Embed dim: read from `model.dino.embed_dim`
- **Heads:** 1×1 conv on patch features, bilinearly upsampled to (H, W)
  - Volume head: `(B, N_WINDOW * N_HEIGHT_BINS, H, W)`
  - Gripper head: `(B, N_WINDOW * N_GRIPPER_BINS, H, W)`
- **Start keypoint conditioning:** learnable embedding added to the patch token at the current EEF pixel location

## Key Hyperparameters

| Parameter | Value |
|---|---|
| Image size | 448 × 448 |
| N_WINDOW | 12 timesteps |
| N_HEIGHT_BINS | 32 |
| N_GRIPPER_BINS | 32 |
| Gripper range | [-0.2, 0.8] |
| DINO patch size | 16 |

## Device / Environment

```python
import torch
device = torch.device(
    "cuda" if torch.cuda.is_available() else
    "mps" if torch.backends.mps.is_available() else
    "cpu"
)
```

- **Mac (local):** MPS — good for debugging, small batch sizes
- **Lab server:** CUDA — full training runs

## Data Paths

| Location | Path |
|---|---|
| Mac — scratch data | `scratch/` (relative to script) or `/Users/cameronsmith/Projects/robotics_testing/3dkeygrip/volume_dino_tracks/scratch/` |
| Lab server — data | TBD — update this when confirmed |
| DINO weights (Mac) | `/Users/cameronsmith/Projects/robotics_testing/random/dinov3/` |
| DINO weights (server) | TBD — update this when confirmed |

## Data Format

Episodes are directories under `dataset_root/`, each named `episode_NNN/`. Per frame `NNNNNN`:
- `NNNNNN.png` — RGB image
- `NNNNNN_gripper_pose.npy` — (4,4) gripper world pose
- `NNNNNN_camera_pose.npy` — (4,4) world-to-camera transform
- `NNNNNN_cam_K.npy` — (3,3) normalized camera intrinsics
- `NNNNNN.npy` — joint state vector (last element = gripper value)

## Repo Structure

```
para/
├── CLAUDE.md                  ← you are here: global project context
├── configs/
│   └── base.yaml              ← shared hyperparameters
├── training/
│   ├── CLAUDE.md              ← general BC training context
│   ├── model.py
│   ├── train.py
│   ├── data.py
│   └── utils.py
├── libero/
│   ├── CLAUDE.md              ← Libero benchmark context
│   └── ...
├── panda_streaming/
│   ├── CLAUDE.md              ← real Panda robot + streaming context
│   └── ...
├── video_training/
│   ├── CLAUDE.md              ← video pretraining pipeline context
│   └── ...
└── vlm/
    ├── CLAUDE.md              ← VLM integration context
    └── ...
```

## Reference Implementation

The original prototype lives at:
`/Users/cameronsmith/Projects/robotics_testing/3dkeygrip/volume_dino_tracks/`

Key files: `model.py`, `train.py`, `data.py`, `utils.py`

## Current Status

- [ ] First training task TBD — awaiting instructions

# Key experiments
## Basic Experiments (OOD Object Positions and Viewpoints )
  Basically want to compare how our model a) generalizes better to new object positions and viewpoints and b) to new viewpoints
  Also how our model should be more data-efficient
## Video as Policy with PARA
  Basically want to compare using PARA regression head vs global regression head on the video policy
  The idea is that our model will be much more data-efficient in learning the joint video/action policy 
## Large-scale pretraining
  Pretraining on large-scale DROID dataset