# Experiment Log — Augmentation for Viewpoint & Spatial Generalization

## Overview

This document logs all experiments run for the CSCI 567 project investigating whether data augmentations can improve ACT's robustness to viewpoint and spatial changes in a LIBERO pick-and-place task.

**Task:** Pick up a black bowl and place it on a plate.  
**Model:** ACT (Action Chunking with Transformers) — DINOv2 or ResNet-18 backbone → global feature → MLP → 3D position prediction.  
**Training data:** 64 demos at default viewpoint (unless noted), 10-30 min training.  
**Eval:** 8x8 rotation grid (theta/phi) and/or 5x5 translation grid (dx/dy), 1-3 episodes per viewpoint.

---

## Key Code Changes

### `data.py`
- Added `augment` parameter to `CachedTrajectoryDataset.__init__()`: supports `"none"`, `"perspective"`, `"crop"`, `"all"`
- Augmentation applied 50% of the time (`np.random.random() < 0.5`)
- Added `aug_matrix` to batch output for visualization
- Crop mode: pure translation (±80px shift, no resize) via `warpPerspective`
- Perspective mode: random H+V perspective warp (±0.15 strength)
- All mode: rotation (±10°) + shear (±0.10) + perspective (±0.10) + crop (±35px, 85% frac)

### `model_act.py`
- Added `drop_start_kp` flag: zeros out `start_keypoint_2d` input to remove viewpoint-dependent 2D pixel conditioning
- Added `backbone` parameter: `"dino"` (DINOv2 ViT-S/16) or `"resnet"` (ResNet-18 pretrained)
- Both saved in checkpoint and auto-loaded at eval time

### `train.py`
- Added `--augment`, `--drop_start_kp`, `--freeze_backbone`, `--backbone` args
- ACT-specific wandb visualization: GT and predicted 3D→2D keypoints with aug-aware projection
- `num_workers` reduced to 4 (from 16) to avoid shared memory exhaustion

### `eval.py` / `eval_multistage.py`
- Added `--cam_dx`, `--cam_dy`, `--cam_dz` for camera translation
- Added `_translate_camera()` function
- Auto-loads `drop_start_kp` and `backbone` from checkpoint

### New scripts
- `eval_full_grid.py` — 8x8 rotation grid eval (theta/phi)
- `eval_translation_grid.py` — 5x5 translation grid eval (dx/dy)
- `eval_translation_grid_fast.py` — 3x3 fast translation eval (1 episode)
- `eval_translation_multistage_5x5.py` — 5x5 translation with miss/grasp/place scoring
- `eval_translation_multistage_fast.py` — 3x3 fast multistage
- `generate_ood_translation.py` — generate training data at translated camera positions
- `gen_rollout_grid.py` — generate 5x5 video grids from eval results
- `viz_augmentations.py` — render augmentation parameter sweep grid

---

## Bug Fixes

### 1. Keypoint conditioning mismatch (`start_keypoint_2d`)
**Problem:** ACT takes `start_keypoint_2d` (2D pixel position of EEF) as input. When image augmentation is applied, the EEF moves to a different pixel location, but `start_keypoint_2d` is not updated. The model receives contradictory information.  
**Fix:** Added `--drop_start_kp` flag to zero out this input. Improved OOD by +4% (20% → 24%).

### 2. Visualization projection bug
**Problem:** Visualization projected 3D keypoints using `cam_extrinsic` + `cam_K_norm` which gave wrong pixel coordinates. The correct projection matrix is `world_to_cam`.  
**Fix:** Changed visualization to use `world_to_camera` (full K*[R|t] projection matrix). Also added `aug_matrix` transform so keypoints track augmented image content.

### 3. Train/eval distribution mismatch
**Problem:** Augmentation applied 100% of the time during training → model never sees clean images → poor performance at eval (no augmentation).  
**Fix:** Changed to 50% augmentation probability.

---

## Experiment Results

### A. Rotation Grid (theta/phi) — Default Viewpoint Training

All models trained on 64 demos at default viewpoint, evaluated on 8x8 rotation grid (3 episodes each).

| Experiment | Backbone | Aug | Time | theta=0 | Overall | Notes |
|---|---|---|---|---|---|---|
| `act_baseline_ours` | DINOv2 | none | 10m | 58% | **20%** | Baseline with keypoint |
| `act_defvp_noaug_nokp` | DINOv2 | none | 20m | 62% | **24%** | Best single-VP (no kp) |
| `act_perspective_aug_v2` | DINOv2 | perspective | 10m | 54% | **21%** | Perspective only |
| `act_defvp_crop50_nokp` | DINOv2 | crop 50% | 20m | 88% | **16%** | Great in-dist, poor OOD |
| `act_defvp_all50_nokp` | DINOv2 | all 50% | 20m | 33% | **10%** | All augs too aggressive |
| `act_resnet_noaug_30m` | ResNet | none | 30m | 54% | **18%** | ResNet baseline |
| `act_resnet_crop50_30m` | ResNet | crop 50% | 30m | 50% | **12%** | Crop still hurts on rotation |

### B. Rotation Grid — All-Viewpoint Training (640 demos, DINOv2)

| Experiment | Aug | Time | theta=0 | Overall | Notes |
|---|---|---|---|---|---|
| `act_allvp_noaug` | none | 10m | 92% | **72%** | Best rotation result |
| `act_allvp_persp` | perspective | 10m | 29% | **26%** | Aug hurts with real data |
| `act_allvp_persp_long` | perspective | 3hr | 8% | **6%** | Overfitting |
| `act_allvp_crop_20m` | crop | 20m | 12% | **12%** | Crop also hurts |

### C. Rotation Grid — Multi-VP Translation Training + Eval on Rotation

Models trained on 50 translation demos (5 positions), evaluated on rotation grid.

| Experiment | Aug | Overall | theta=7.1 | theta=10.7 | theta=21.4 |
|---|---|---|---|---|---|
| `act_resnet_multivp_crop_rot` | crop 50% | **19%** | 33% | 8% | 0% |
| `act_resnet_multivp_allaug_rot` | all 50% | **11%** | **37%** | **17%** | **8%** |
| `act_resnet_multivp_noaug_rot` | none | **6%** | 12% | 4% | 0% |

Note: All-aug is best at mid-range rotation (theta 7-21) despite training only on translation data.

### D. Translation Grid (dx/dy) — Single-Viewpoint Training

5x5 grid (±15cm horizontal, ±10cm vertical), fixed object position. Binary eval.

| Experiment | Backbone | Aug | Overall |
|---|---|---|---|
| `trans_noaug_fixed` | DINOv2 | none | 8% |
| `trans_crop50_fixed` | DINOv2 | crop 50% | 3% |
| `trans_resnet_bigcrop` | ResNet | big crop (65%) | **13%** |
| `trans_resnet_translate` | ResNet | pure translate ±80px | 5% |
| `trans_resnet_noaug_fixed` | ResNet | none | 8% |

### E. Translation Grid — Multi-Stage Scoring (miss/grasp/place)

5x5 grid, 1 episode each. PLACE = full success, GRASP = bowl lifted, MISS = no grasp.

| Experiment | Place | Grasp | Miss | Grasp+ |
|---|---|---|---|---|
| **Single-VP, No Aug** | 3 | 4 | 18 | 7 |
| **Single-VP, Big Crop** | 0 | 5 | 20 | 5 |
| **Single-VP, Crop→NoAug Curriculum** | **2** | **9** | 14 | **11** |
| **Multi-VP + Crop** | **2** | **8** | 15 | **10** |
| **Multi-VP + All Aug** | 0 | 5 | 20 | 5 |
| **Multi-VP + No Aug** | 0 | 2 | 23 | 2 |

### F. Multi-Stage Translation Grids (best models)

**Crop→NoAug Curriculum (11 grasp+):**
```
dx\dy   -0.100  -0.050   0.000   0.050   0.100
-0.150   miss    miss    GRASP   miss    miss
-0.075   GRASP   PLACE   GRASP   GRASP   miss
 0.000   GRASP   GRASP   PLACE   GRASP   GRASP
 0.075   miss    miss    miss    GRASP   miss
 0.150   miss    miss    miss    miss    miss
```

**Multi-VP + Crop (10 grasp+):**
```
dx\dy   -0.100  -0.050   0.000   0.050   0.100
-0.150   miss    miss    GRASP   miss    miss
-0.075   PLACE   PLACE   GRASP   GRASP   miss
 0.000   GRASP   GRASP   miss    miss    miss
 0.075   miss    GRASP   miss    miss    miss
 0.150   miss    miss    GRASP   miss    GRASP
```

**Single-VP No Aug (7 grasp+):**
```
dx\dy   -0.100  -0.050   0.000   0.050   0.100
-0.150   miss    miss    miss    miss    miss
-0.075   GRASP   PLACE   miss    GRASP   miss
 0.000   miss    GRASP   PLACE   PLACE   GRASP
 0.075   miss    miss    miss    miss    miss
 0.150   miss    miss    miss    miss    miss
```

---

## Datasets

| Dataset | Location | Demos | Viewpoints | Description |
|---|---|---|---|---|
| Default VP | `/data/libero/ood_objpos_v3_splits/exp4_n64_train` | 64 | 1 (default) | Varied object positions |
| All Rotation VP | `/data/libero/ood_viewpoint_v3` | 640 | 64 (8x8 theta/phi) | Spherical cap grid |
| Translation VP | `/data/libero/ood_translation_v1` | 50 | 5 (center + 4 corners) | Camera translated ±10cm H, ±7.5cm V |

---

## Key Takeaways

1. **Real multi-view data >> augmentation.** Training on 640 rotation viewpoints gave 72%, while the best augmentation-only approach gave 24%.

2. **Crop augmentation helps spatial (translation) generalization** when paired with real multi-view data (10 vs 2 grasp+), but hurts rotation generalization when used alone.

3. **Curriculum learning works:** Crop→NoAug training achieves the best single-viewpoint translation robustness (11 grasp+).

4. **Each augmentation helps its own distribution.** Crop helps with translations. Perspective/shear helps with rotations. Neither transfers well to the other type.

5. **Val loss does not predict sim success.** Longer training with lower val loss often performs worse in actual sim evaluation.

6. **Removing 2D keypoint conditioning (`--drop_start_kp`) improves OOD generalization** by eliminating the augmentation-induced mismatch.

---

## Reproduction

```bash
cd /data/cameron/567_augmentation_viewpoint_project
export PYTHONPATH=/data/cameron/LIBERO:$PYTHONPATH
export DINO_REPO_DIR=/data/cameron/keygrip/dinov3
export DINO_WEIGHTS_PATH=/data/cameron/keygrip/dinov3/weights/dinov3_vits16plus_pretrain_lvd1689m-4057cbaa.pth
export LIBERO_DATA_PATH=/data/libero

# Train (example: ResNet + crop aug on translation data)
CUDA_VISIBLE_DEVICES=X python train.py --model_type act \
    --run_name my_experiment \
    --benchmark libero_spatial --task_id 0 \
    --cache_root /data/libero/ood_translation_v1 \
    --batch_size 8 --lr 1e-4 --max_minutes 30 \
    --skip_rotation --vis_every_steps 100 \
    --augment crop --drop_start_kp --backbone resnet \
    --wandb_project 567_viewpoint --wandb_mode online

# Eval rotation grid
python eval_full_grid.py checkpoints/my_experiment/best.pth my_experiment_rot GPU_ID

# Eval translation grid (multi-stage)
python eval_translation_multistage_5x5.py checkpoints/my_experiment/best.pth my_experiment_trans GPU_ID

# Generate data at translated viewpoints
python generate_ood_translation.py --demos_per_view 10 --out_root /data/libero/ood_translation_v1
```
