# TASKS — Onboarding roadmap for Runhao

Goal: end with a Panda picking up a bowl from a PARA policy *you* trained
on data *you* collected, with a camera you calibrated yourself. The four
stages below are sequential: each one produces an artifact (a saved image
or video) that proves the prior stage is correct.

> **Read [`docs/always_visualize.md`](docs/always_visualize.md) first.** Every
> stage below ends in "save a visualization." That's not optional.

---

## Stage 1 — Render the robot mask onto the live RealSense image

**Output:** a side-by-side PNG (or short MP4) showing
`real RGB | MuJoCo robot render | overlay`. The overlay should align with
the real arm's silhouette to within a few pixels.

**Why this comes first:** you can't validate calibration, recordings, IK,
or the policy without first being able to project the robot into the
image. Once the overlay aligns, almost everything downstream that's wrong
will be visible *immediately*.

**What to do:**

1. Run hand-eye calibration to get a current `T_cam_world` and `CAM_K`.
   - Mount the 4×2 ArUco board on the Panda hand (uses
     `ExoConfigs/panda_exo_handeye_4x2.py`).
   - Use `panda_streaming/hand_eye_calib/command_calib_poses.py` to drive
     the arm through a set of calibration poses.
   - Use `panda_streaming/hand_eye_calib/calibrate.py` as a reference
     for the joint solver (see its `solve_hand_eye` function — it does the
     unprivileged-info version: PnP per frame → joint nonlinear solve for
     `T_cam_world` and `T_hand_board`).
   - **Visualize:** save `aruco_reproject_check.png` (markers reprojected
     on the image) and `mujoco_overlay_check.png` (MuJoCo render of the
     calibrated camera vs the real frame). The `hand_eye_calib/` folder
     already has examples of what these should look like.

2. Plug the calibrated `T_cam_world` and `CAM_K` into a streaming script
   that:
   - Subscribes to `/joint_states` over rosbridge.
   - Sets the MuJoCo Panda to those joints.
   - Renders a segmentation mask from the calibrated camera pose.
   - Alpha-blends the mask onto the live RealSense frame.

   Start from `panda_streaming/stream_panda_with_cam.py` — it already does
   the live-feed + MuJoCo overlay loop using ArUco-based pose estimation.
   Replace the per-frame ArUco pose with your fixed calibrated extrinsics
   for a cleaner overlay.

3. **Acceptance:** save a 5-second MP4 of the overlay while the arm
   moves through several poses (free-hand or via teleop). The mask should
   track the arm. Drop it in `panda_streaming/checkpoints/stage1_mask_overlay.mp4`
   and reference it from `panda_streaming/CLAUDE.md`.

---

## Stage 2 — Collect a dataset and visualize masks + keypoints

**Output:** a directory of episodes under `/data/cameron/panda_data/<run>/`,
plus a visualization MP4 per episode showing the recorded RGB with:

- the **robot mask** overlay from MuJoCo (proves calibration is still
  good across the whole recording),
- the **EEF keypoint** projected onto the image (white dot),
- the **ground projection** + height line (cyan ring + yellow line — see
  `train_panda_para.build_wandb_strip` for the convention).

**What to do:**

1. Record a session of bowl pick-up demos:
   ```bash
   cd panda_streaming
   python simple_dataset_record_panda.py <stream_host> <stream_port>
   ```
   Aim for ~20 demos at first. Vary the bowl position; keep the camera
   fixed.

2. Parse the raw recording into episodes:
   ```bash
   python parse_video_into_episodes_panda.py -i scratch/rgb_joints_capture_panda_<TS>
   ```
   Mark start/end frames per demo. Output goes to `parsed_<run>/`.

3. Pre-cache 448×448 images and add to the data viewer (Cameron's habit:
   masks → 448 cache → data viewer; see Cameron's memory note on the
   panda dataset pipeline). The training data loader expects:
   - `cached_448/<frame>.jpg`
   - `<frame>.npy` (joint states, 7 arm + 1 gripper)
   - `episodes.json` (start/end frame indices per episode)
   - calibrated `T_CAM_WORLD` + `CAM_K` (currently hardcoded in
     `data_panda_para.py` — replace with your calibration).

4. Use `vis_dataset_gt.py` (or a small script of your own) to render the
   visualization MP4 per episode. Confirm by eye that:
   - the robot mask sits on the arm in *every* frame,
   - the EEF keypoint sits on the gripper tip,
   - the height bar drops as the gripper approaches the table.

5. **Acceptance:** check in one example MP4 to
   `panda_streaming/checkpoints/stage2_episode_overlay.mp4` (small clip,
   not the whole dataset).

---

## Stage 3 — Reproduce a recorded trajectory via IK

**Output:** a side-by-side comparison: for one recorded episode, render
both
- the recorded MuJoCo state (from `/joint_states`), and
- the IK-recovered state (from EEF 3D world poses → damped least-squares
  IK in MuJoCo)
from the calibrated camera, and verify:
- the **3D EEF keypoints** match (mm-level error),
- the **2D projected pixels** match (sub-pixel error).

**Why this comes before training:** at deploy time we run IK from the
model's predicted 3D point. If IK from the *ground-truth* 3D point can't
reproduce the trajectory, the whole pipeline is broken and the model
won't help. This stage isolates the kinematics chain.

**What to do:**

1. Use `panda_streaming/test_ik_recovery.py` as your starting point. It
   already loads the dataset, FKs the recorded joints to get GT EEF poses,
   then runs `mujoco_ik` with `FIXED_EEF_ROT` to recover joint states from
   the GT poses, and compares.

2. Render two MuJoCo views per frame:
   - "GT" using recorded joints,
   - "IK" using IK-recovered joints from the recorded EEF pose,
   then project the EEF onto each and overlay the recorded RGB.

3. Compute and log:
   - 3D position error: `||IK_eef_pos - GT_eef_pos||` per frame, mean + max.
   - 2D pixel error: `||proj(IK_eef) - proj(GT_eef)||` at IMG_W×IMG_H and
     at the cached 448×448 resolution.

4. **Acceptance:** save `panda_streaming/checkpoints/stage3_ik_recovery.png`
   showing the comparison plus error stats. Mean 3D error should be
   sub-mm; mean 2D error should be sub-pixel at 1080p. If not, the
   calibration from Stage 1 is wrong — go back, don't move forward.

---

## Stage 4 — Train PARA and pick up the bowl

**Output:** a video of the Panda picking up a bowl using
`deploy_ik_sequence.py` driving from a PARA checkpoint you trained.

**What to do:**

1. **Train.** From `panda_streaming/`:
   ```bash
   CUDA_VISIBLE_DEVICES=<free_gpu> MUJOCO_GL=egl \
   DINO_REPO_DIR=/data/cameron/keygrip/dinov3 \
   DINO_WEIGHTS_PATH=/data/cameron/.cache/torch/hub/checkpoints/dinov3_vits16plus_pretrain_lvd1689m-4057cbaa.pth \
   python train_panda_para.py \
     --data_dir /data/cameron/panda_data/<your_run> \
     --run_name panda_para_bowl_v1 \
     --epochs 500 --batch_size 4
   ```
   Watch wandb. The `vis/train_strip` and `vis/val_strip` images show
   per-timestep heatmaps + GT/pred keypoints — once those align with the
   gripper consistently across the val set, you're ready to deploy.

2. **Deploy.** From `panda_streaming/`:
   ```bash
   MUJOCO_GL=egl python deploy_ik_sequence.py \
     --checkpoint checkpoints/panda_para_bowl_v1/best.pth \
     --bowl_pose <wherever you put it>
   ```
   `deploy_ik_sequence.py` runs the model on the current frame, gets a
   12-step (`N_WINDOW`) trajectory of `(u, v, h)` + gripper, unprojects to
   3D, runs IK to joint states, and publishes them to
   `/gello/joint_states`.

3. **Acceptance:** an MP4 of the Panda actually picking up the bowl, plus
   a wandb run URL. Drop the MP4 at
   `panda_streaming/checkpoints/stage4_bowl_pickup.mp4`.

---

## How to report progress on each stage

Use the fleet `Report` helper (see
`/data/cameron/agents_stuff/shared/REPORT_FORMAT.md`). Every report needs:

1. Summary
2. Training data (sample images/video)
3. Test setup (sample frames showing what's different)
4. Results (table + at least one success and one failure clip)
5. Analysis (failure modes)
6. Next steps & concerns
7. Reproducibility (exact command)

Drop reports under `/data/cameron/para/.agents/reports/<your_agent>/`.