# System overview — PARA on a real Panda

## Why we're doing this

Standard robot policies try to predict 7-DoF EEF poses (or joint angles)
directly from a global image descriptor. That puts the burden of solving
correspondence, geometry, and control on a single regression head.

PARA — **Pixel-Aligned Robot Actions** — flips this. Instead of "where is
the gripper?" the model answers two structured questions:

1. **In the image, which pixel does the EEF go to?** (a 2D heatmap →
   `(u, v)`).
2. **At that pixel, how high above the table is it?** (per-pixel
   discretized height → height bin).

`(u, v) + h` plus camera intrinsics + extrinsics gives you a 3D world
point. From that 3D point we run inverse kinematics on the Panda to
recover joint targets, and we publish those over rosbridge to the robot.

Gripper open/close and end-effector rotation are predicted by indexing
the same DINOv3 patch feature map at the chosen pixel — small MLP heads
on top of a 384-dim feature vector.

This pixel-aligned framing is the "ours" we keep mentioning in PARA
papers; the real-Panda experiments in this repo are about validating that
the same idea works outside simulation.

## How the pieces fit together

```
                                  ┌──────────────────┐
                                  │   RealSense RGB  │
                                  └────────┬─────────┘
                                           │
                                           ▼
                                ┌──────────────────────┐
                                │   PARA model         │
                                │  (DINOv3 + heatmap   │
                                │   volume + heads)    │
                                └────────┬─────────────┘
                                         │  (u, v, h, gripper, rot) × N_WINDOW
                                         ▼
                                ┌──────────────────────┐
                                │  Unproject (u, v, h) │   ← needs T_cam_world, K
                                │  to 3D world point   │     from hand-eye calib
                                └────────┬─────────────┘
                                         │  (x, y, z)_world × N_WINDOW
                                         ▼
                                ┌──────────────────────┐
                                │  MuJoCo damped IK    │
                                │  (fixed EEF rotation │
                                │   for now)           │
                                └────────┬─────────────┘
                                         │  q1..q7 × N_WINDOW
                                         ▼
                                ┌──────────────────────┐
                                │   rosbridge publish  │
                                │   /gello/joint_states│
                                └────────┬─────────────┘
                                         ▼
                                       Panda
```

## Why hand-eye calibration is the linchpin

`(u, v, h)` is meaningless to the robot. It needs to become a 3D world
point, which requires `K` and `T_cam_world`. `K` you can get from any
calibration target image. `T_cam_world` you cannot — it depends on where
the camera is bolted relative to the robot's base, and that changes every
time the rig is re-mounted.

Hand-eye calibration solves it cheaply: you put an ArUco board on the
robot, drive the arm through a handful of poses, and from the per-pose
detections the camera sees, you can jointly solve:

- `T_cam_world`  — where the camera is in the robot's world frame.
- `T_hand_board` — where the board sits on the gripper (lets you sanity
  check; the joint solver doesn't need this a priori).

`hand_eye_calib/calibrate.py` is the reference implementation. It does
the standard TSAI initialization, then a nonlinear refinement on both
unknowns simultaneously — no privileged sim information used.

If `T_cam_world` is wrong, **everything else here is silently wrong**:
the projected EEF pixel that you train against will drift across the
dataset, and at deploy time the model's "go to that pixel" will lift to
the wrong 3D point. Visualize the calibration result before you record a
dataset with it.

## What's currently calibrated vs not

`panda_streaming/data_panda_para.py` ships a hardcoded `T_CAM_WORLD` and
`CAM_K` that worked for one historical capture session. **Do not assume
they're right for your camera/mount.** Stage 1 of `TASKS.md` explicitly
asks you to re-run hand-eye and replace these values (or load them from a
JSON the dataset reads at init).

## Why `N_WINDOW` matters

The model predicts 6 future timesteps jointly (`N_WINDOW = 6` in
`para/model.py`; some places have 12 — check the constant where you are).
At deploy time we execute the whole window open-loop before re-querying.
This trades a little reactivity for a lot of stability — the heatmap
predictions smooth out across the window, and IK has time to plan a
graceful path. If you want closed-loop, drop `N_WINDOW` to 1 or replan
every step.

## Where the LIBERO version lives

The simulation counterpart of this code (LIBERO benchmark) is
`/data/cameron/para/libero/` in the parent repo. The model file we use
here is a copy of `/data/cameron/para/libero/model.py`. If you change
the model, decide whether the change should propagate to LIBERO and
keep them in sync; otherwise diverge them deliberately.