# ACT Viewpoint Generalization with Augmentations

## Project Goal

Train an ACT (Action Chunking with Transformers) model on robot manipulation data from one or more camera viewpoints, then evaluate generalization to unseen viewpoints. The key research question: **can data augmentations improve ACT's robustness to viewpoint changes?**

Baseline (no augmentations): ACT trained at the default viewpoint achieves **56% → 28%** degradation when evaluated across a full 8x8 viewpoint grid. Your goal is to improve on 28% using augmentations.

---

## 1. Setup — Copy Required Files

Copy the relevant training/eval code into your own project folder. **Do not modify the originals.**

```bash
# Copy the files you'll need
cp /data/cameron/para_normalized_losses/libero/model_act.py /data/cameron/567_augmentation_viewpoint_project/
cp /data/cameron/para_normalized_losses/libero/train.py /data/cameron/567_augmentation_viewpoint_project/
cp /data/cameron/para_normalized_losses/libero/eval.py /data/cameron/567_augmentation_viewpoint_project/
cp /data/cameron/para_normalized_losses/libero/data.py /data/cameron/567_augmentation_viewpoint_project/
cp /data/cameron/para_normalized_losses/libero/model.py /data/cameron/567_augmentation_viewpoint_project/
cp /data/cameron/para_normalized_losses/libero/utils.py /data/cameron/567_augmentation_viewpoint_project/

# You'll also need these for creating splits and visualizations
cp /data/cameron/para_normalized_losses/libero/create_viewpoint_splits.py /data/cameron/567_augmentation_viewpoint_project/
cp /data/cameron/para_normalized_losses/libero/generate_ood_viewpoint.py /data/cameron/567_augmentation_viewpoint_project/
```

### Environment

```bash
# Always set these before running anything
export PYTHONPATH=/data/cameron/LIBERO:$PYTHONPATH
export DINO_REPO_DIR=/data/cameron/keygrip/dinov3
export DINO_WEIGHTS_PATH=/data/cameron/keygrip/dinov3/weights/dinov3_vits16plus_pretrain_lvd1689m-4057cbaa.pth
export LIBERO_DATA_PATH=/data/libero
```

---

## 2. Datasets

### Object Position Dataset (default viewpoint, varied positions)

This is the main training dataset — 256 demos at the default camera viewpoint with object positions spanning a 39cm x 60cm grid across the table.

```
Location: /data/libero/ood_objpos_v3/libero_spatial/task_0/
Structure: 256 demos (demo_0 through demo_255)
Grid:      16x16, dx=[-0.40, -0.01], dy=[-0.30, +0.30]
Viewpoint: Default agentview camera (fixed)
Scene:     Clean (no distractors, no furniture)
```

### Viewpoint Dataset (varied viewpoints + varied positions)

640 demos across 64 viewpoints (8x8 spherical cap grid) with random object positions.

```
Location: /data/libero/ood_viewpoint_v3/libero_spatial/task_0/
Structure: 640 demos (demo_0 through demo_639)
Grid:      8x8 viewpoints (θ: 0° to 25°, φ: 0° to 315°)
           10 demos per viewpoint, each with a random object position
Positions: dx ∈ [-0.40, -0.01], dy ∈ [-0.30, +0.30] (randomly sampled)
Scene:     Clean
```

### Per-Demo File Structure

Each demo directory contains:
```
demo_NNN/
├── frames/           # 448x448 RGB images (000000.png, 000001.png, ...)
│                     # ~32 frames per trajectory
├── eef_pos.npy       # (T, 3) float32 — EEF world-frame XYZ positions
├── eef_quat.npy      # (T, 4) float32 — EEF quaternion orientation
├── gripper.npy       # (T,) float32 — gripper state (-1=open, +1=closed)
├── pix_uv.npy        # (T, 2) float32 — EEF pixel projection [col, row]
├── cam_extrinsic.npy # (4, 4) float32 — camera extrinsic matrix
├── cam_K_norm.npy    # (3, 3) float32 — normalized camera intrinsics
├── world_to_cam.npy  # (4, 4) float32 — world-to-camera transform
├── base_z.npy        # scalar float32 — table height (0.912)
├── actions.npy       # (T, 7) float32 — zeros (not used, servo replay)
└── meta.npz          # viewpoint dataset only: vi, di, theta_idx, phi_idx, dx, dy
```

### Viewpoint Grid Metadata

```python
import numpy as np
meta = np.load("/data/libero/ood_viewpoint_v3/libero_spatial/task_0/viewpoint_meta.npz")
print(meta['thetas_deg'])   # [ 0.   3.6  7.1 10.7 14.3 17.9 21.4 25. ]
print(meta['phis_deg'])     # [  0.  45.  90. 135. 180. 225. 270. 315.]
print(meta['n_views'])      # 8
print(meta['demos_per_view'])  # 10
# demo_idx = viewpoint_index * demos_per_view + demo_within_viewpoint
# viewpoint_index = theta_idx * n_views + phi_idx
```

---

## 3. Existing Checkpoint (Baseline — No Augmentations)

ACT model trained on 64 object positions at the default viewpoint:

```
Checkpoint: /data/cameron/para_normalized_losses/libero/checkpoints/act_v2_exp4_n64/best.pth
Training:   64 demos from /data/libero/ood_objpos_v3_splits/exp4_n64_train/
Viewpoint:  Default agentview only
Positions:  64 positions from 8x8 evenly-spaced grid
Duration:   10 minutes
```

**Baseline results** (evaluated across full 8x8 viewpoint grid):
- Overall: **28%** (53/192, 3 episodes per viewpoint)
- θ=0° (training viewpoint): 67%
- θ=3.6°: 50%
- θ=7.1°: 50%
- θ=10.7°: 21%
- θ=14.3°: 12%
- θ=17.9°: 12%
- θ=21.4°: 4%
- θ=25.0°: 6%

---

## 4. Training

### Basic Training Command

```bash
cd /data/cameron/567_augmentation_viewpoint_project

CUDA_VISIBLE_DEVICES=X python train.py \
    --model_type act \
    --run_name act_my_experiment \
    --benchmark libero_spatial \
    --task_id 0 \
    --cache_root /data/libero/ood_objpos_v3 \
    --batch_size 8 \
    --lr 1e-4 \
    --epochs 9999 \
    --max_minutes 10 \
    --skip_rotation \
    --vis_every_steps 0 \
    --wandb_project 567_viewpoint \
    --wandb_mode online
```

Key arguments:
- `--cache_root`: path to dataset (the `libero_spatial/task_0/` structure must be inside)
- `--max_minutes 10`: time-capped training (10 min is enough for these dataset sizes)
- `--skip_rotation`: disables rotation prediction (rotation is near-constant for this task)
- `--model_type act`: uses the ACT model (CLS token regression)

### Training on Viewpoint Subsets

Create splits using symlinks (see Section 7), then point `--cache_root` at the split:

```bash
# Example: train on left hemisphere viewpoints
CUDA_VISIBLE_DEVICES=X python train.py \
    --model_type act \
    --run_name act_left_hemi \
    --cache_root /data/libero/ood_viewpoint_v3_splits/vp_left_hemi_train \
    --batch_size 8 --lr 1e-4 --max_minutes 10 \
    --skip_rotation --vis_every_steps 0
```

### Where to Add Augmentations

The image preprocessing happens in `data.py` in `CachedTrajectoryDataset.__getitem__()`. The raw RGB frame is loaded around line 355:

```python
rgb = cv2.imread(str(frames_dir / f"{frame_idx:06d}.png"))
```

Add your augmentations here (random crops, color jitter, viewpoint simulation, etc.) before the frame is used for training. **Important**: if you augment the image, you may need to update `pix_uv` (the pixel projection) correspondingly if using spatial augmentations.

For the ACT model specifically, the augmentation should be applied to the image that gets fed to `model_act.py`'s forward pass. The key insight: ACT's targets are world-frame 3D positions (viewpoint-invariant), so image augmentations that simulate viewpoint changes should help without needing to change the targets.

---

## 5. Evaluation

### Basic Eval at Default Viewpoint

```bash
CUDA_VISIBLE_DEVICES=X python eval.py \
    --model_type act \
    --checkpoint checkpoints/act_my_experiment/best.pth \
    --benchmark libero_spatial \
    --task_id 0 \
    --n_episodes 10 \
    --teleport \
    --zero_rotation \
    --clean_scene \
    --max_steps 600 \
    --shift_dx 0.0509 \
    --shift_dy -0.2063 \
    --out_dir eval_output/my_experiment
```

**Critical eval arguments:**
- `--teleport`: closed-loop servo to predicted 3D targets (essential — open-loop fails)
- `--zero_rotation`: zeroes rotation deltas (rotation prediction is disabled)
- `--clean_scene`: removes furniture and distractor objects
- `--shift_dx 0.0509 --shift_dy -0.2063`: shifts objects to match training position (centering offset). **Always include these** for viewpoint experiments.
- `--max_steps 600`: episode timeout
- `--save_video`: add this flag to save mp4 videos of rollouts

### Eval at a Specific Viewpoint

Use `--cam_theta` and `--cam_phi` to reposition the camera:

```bash
# Evaluate at theta=15 degrees, phi=90 degrees
CUDA_VISIBLE_DEVICES=X python eval.py \
    --model_type act \
    --checkpoint checkpoints/act_my_experiment/best.pth \
    --benchmark libero_spatial --task_id 0 \
    --n_episodes 5 \
    --teleport --zero_rotation --clean_scene --max_steps 600 \
    --shift_dx 0.0509 --shift_dy -0.2063 \
    --cam_theta 15.0 --cam_phi 90.0 \
    --out_dir eval_output/theta15_phi90 \
    --save_video
```

### Eval Across Full Viewpoint Grid

Script to evaluate at all 64 viewpoints:

```python
import subprocess, numpy as np, json

n_views = 8
thetas = np.linspace(0, 25, n_views)
phis = np.linspace(0, 360*(1-1/n_views), n_views)
center_dx, center_dy = 0.0509, -0.2063
rng = np.random.RandomState(42)

checkpoint = "checkpoints/act_my_experiment/best.pth"
results = {}

for ti, theta in enumerate(thetas):
    for pi, phi in enumerate(phis):
        vi = ti * n_views + pi
        # Random object position for each viewpoint
        dx = rng.uniform(-0.40, -0.01)
        dy = rng.uniform(-0.30, 0.30)
        shift_dx = center_dx + dx
        shift_dy = center_dy + dy

        cmd = (f"python eval.py --model_type act --checkpoint {checkpoint} "
               f"--benchmark libero_spatial --task_id 0 --n_episodes 3 "
               f"--teleport --zero_rotation --clean_scene --max_steps 600 "
               f"--shift_dx {shift_dx} --shift_dy {shift_dy} "
               f"--cam_theta {theta} --cam_phi {phi} "
               f"--out_dir /tmp/eval_grid_{vi}")

        result = subprocess.run(cmd.split(), capture_output=True, text=True)
        for line in result.stdout.split('\n'):
            if 'Success Rate' in line:
                rate = float(line.split(':')[1].strip().replace('%','')) / 100
                results[vi] = {"theta": theta, "phi": phi, "rate": rate}
                print(f"  vi={vi} θ={theta:.1f}° φ={phi:.1f}° → {rate*100:.0f}%")

# Print grid
print("\nθ\\φ   " + "  ".join(f"{p:5.0f}" for p in phis))
for ti, theta in enumerate(thetas):
    row = f"{theta:5.1f}"
    for pi in range(n_views):
        vi = ti * n_views + pi
        if vi in results:
            row += f"  {results[vi]['rate']*100:4.0f}%"
        else:
            row += "    --"
    print(row)
```

---

## 6. ACT Model Architecture Details

**Input** (concatenated, fed to MLP):
- DINOv2 ViT-S/16 CLS token: 384-dim (encodes full image)
- Start keypoint: 2-dim (current EEF pixel position, normalized to [0,1])
- Current EEF position: 3-dim (world-frame XYZ, normalized to [0,1] via dataset min/max)
- Current gripper state: 1-dim (normalized to [0,1])
- CLIP task embedding: 384-dim (projected from 512-dim)

**Output** (all through sigmoid → [0,1]):
- Position MLP: (B, 4, 3) — next 4 timesteps, 3D world-frame position
- Rotation MLP: (B, 4, 3) — ignored (skip_rotation)
- Gripper MLP: (B, 4) — raw logits, thresholded at 0

**Prediction space**: World/robot frame (NOT camera space). Targets are absolute EEF (x, y, z) positions normalized to [0,1] using per-axis min/max from the training dataset. This means the targets are inherently viewpoint-invariant — the same 3D position has the same target regardless of camera angle.

**Position normalization**:
```python
MIN_POS, MAX_POS  # per-axis [x_min, y_min, z_min], [x_max, y_max, z_max]
# Computed from training data and saved in checkpoint
normalized = (world_pos - MIN_POS) / (MAX_POS - MIN_POS)  # → [0, 1]
```

---

## 7. Creating Viewpoint Train/Test Splits

Splits are directories of symlinks pointing into the base dataset:

```python
import os, shutil, numpy as np
from pathlib import Path

data_root = Path("/data/libero/ood_viewpoint_v3")
splits_root = Path("/data/cameron/567_augmentation_viewpoint_project/splits")
splits_root.mkdir(exist_ok=True)

n_views = 8
demos_per_view = 10

def create_split(name, viewpoint_indices):
    split_dir = splits_root / f"{name}_train" / "libero_spatial" / "task_0"
    if split_dir.exists():
        shutil.rmtree(split_dir)
    split_dir.mkdir(parents=True)
    new_idx = 0
    for vi in sorted(viewpoint_indices):
        for di in range(demos_per_view):
            real_idx = vi * demos_per_view + di
            src = data_root / "libero_spatial" / "task_0" / f"demo_{real_idx}"
            dst = split_dir / f"demo_{new_idx}"
            os.symlink(str(src.resolve()), str(dst))
            new_idx += 1
    print(f"  {name}: {len(viewpoint_indices)} viewpoints, {new_idx} demos")

# Example splits:
# Left hemisphere: phi_idx 0-3
left = [vi for vi in range(64) if (vi % n_views) in [0,1,2,3]]
create_split("left_hemi", left)

# Inner hemisphere: theta_idx 0-3
inner = [vi for vi in range(64) if (vi // n_views) in [0,1,2,3]]
create_split("inner_hemi", inner)

# All viewpoints
create_split("all", list(range(64)))
```

---

## 8. Viewpoint Distribution Visualizations

Generate polar plot + image grid showing train vs test viewpoint distributions:

```python
import cv2, numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

n_views = 8
thetas = np.linspace(0, 25, n_views)
phis = np.linspace(0, 360*(1-1/n_views), n_views)

def make_polar_plot(train_vis, test_vis, title, out_path):
    """
    train_vis: list of viewpoint indices used for training
    test_vis:  list of viewpoint indices used for testing
    """
    fig, ax = plt.subplots(1, 1, figsize=(4, 4), subplot_kw=dict(projection='polar'))
    fig.patch.set_facecolor('#1a1a1a')
    ax.set_facecolor('#1a1a1a')

    train_labeled = test_labeled = False
    for vi in range(64):
        ti, pi = vi // n_views, vi % n_views
        theta, phi = thetas[ti], phis[pi]
        is_train = vi in train_vis
        color = '#66ff66' if is_train else '#6496ff'
        label = None
        if is_train and not train_labeled:
            label = f'Train (n={len(train_vis)})'; train_labeled = True
        elif not is_train and not test_labeled:
            label = f'Test (n={len(test_vis)})'; test_labeled = True
        ax.scatter(np.radians(phi), theta, c=color, s=100,
                  edgecolors='white', linewidths=0.5, label=label)

    ax.set_ylim(0, 30)
    ax.set_yticks([0, 5, 10, 15, 20, 25])
    ax.set_yticklabels(['0°','5°','10°','15°','20°','25°'], fontsize=8, color='white')
    ax.set_title(title, color='white', fontsize=11, pad=15)
    ax.tick_params(colors='white')
    ax.grid(True, alpha=0.3, color='gray')
    ax.legend(fontsize=7, facecolor='#2a2a2a', edgecolor='gray', labelcolor='white')
    plt.tight_layout()
    fig.savefig(out_path, dpi=150, bbox_inches='tight', facecolor='#1a1a1a')
    plt.close()
    print(f"Saved: {out_path}")

# Example: default → all
train = [vi for vi in range(64) if (vi // n_views) == 0]  # theta=0 only
test = [vi for vi in range(64) if (vi // n_views) != 0]
make_polar_plot(train, test, "Default → All Viewpoints", "default_to_all.png")
```

---

## 9. Quick Start — Reproduce Baseline

```bash
cd /data/cameron/567_augmentation_viewpoint_project

export PYTHONPATH=/data/cameron/LIBERO:$PYTHONPATH
export DINO_REPO_DIR=/data/cameron/keygrip/dinov3
export DINO_WEIGHTS_PATH=/data/cameron/keygrip/dinov3/weights/dinov3_vits16plus_pretrain_lvd1689m-4057cbaa.pth

# Step 1: Evaluate existing baseline checkpoint at a few viewpoints
for theta in 0 10 20; do
    CUDA_VISIBLE_DEVICES=4 python eval.py \
        --model_type act \
        --checkpoint /data/cameron/para_normalized_losses/libero/checkpoints/act_v2_exp4_n64/best.pth \
        --benchmark libero_spatial --task_id 0 --n_episodes 5 \
        --teleport --zero_rotation --clean_scene --max_steps 600 \
        --shift_dx 0.0509 --shift_dy -0.2063 \
        --cam_theta $theta --cam_phi 0 \
        --out_dir eval_baseline/theta${theta} --save_video
    echo "theta=$theta done"
done

# Step 2: Train your own ACT model (with your augmentations added to data.py)
CUDA_VISIBLE_DEVICES=4 python train.py \
    --model_type act \
    --run_name act_with_augmentations \
    --benchmark libero_spatial --task_id 0 \
    --cache_root /data/libero/ood_objpos_v3 \
    --batch_size 8 --lr 1e-4 --max_minutes 10 \
    --skip_rotation --vis_every_steps 0 \
    --wandb_project 567_viewpoint --wandb_mode online

# Step 3: Evaluate your model at OOD viewpoints
CUDA_VISIBLE_DEVICES=4 python eval.py \
    --model_type act \
    --checkpoint checkpoints/act_with_augmentations/best.pth \
    --benchmark libero_spatial --task_id 0 --n_episodes 5 \
    --teleport --zero_rotation --clean_scene --max_steps 600 \
    --shift_dx 0.0509 --shift_dy -0.2063 \
    --cam_theta 15 --cam_phi 90 \
    --out_dir eval_augmented/theta15_phi90 --save_video
```

---

## 10. Full Reproduction: Default → All Viewpoints Report

This section walks through reproducing the complete experiment end-to-end: train at default viewpoint, eval across the full 8x8 grid, and generate a report with tables, visualizations, and eval video grids.

### Step 1: Train (or use existing baseline)

```bash
cd /data/cameron/567_augmentation_viewpoint_project
export PYTHONPATH=/data/cameron/LIBERO:$PYTHONPATH
export DINO_REPO_DIR=/data/cameron/keygrip/dinov3
export DINO_WEIGHTS_PATH=/data/cameron/keygrip/dinov3/weights/dinov3_vits16plus_pretrain_lvd1689m-4057cbaa.pth

# Train on default viewpoint with 64 diverse object positions
CUDA_VISIBLE_DEVICES=4 python train.py \
    --model_type act \
    --run_name act_baseline \
    --benchmark libero_spatial --task_id 0 \
    --cache_root /data/libero/ood_objpos_v3_splits/exp4_n64_train \
    --batch_size 8 --lr 1e-4 --max_minutes 10 \
    --skip_rotation --vis_every_steps 0 \
    --wandb_project 567_viewpoint --wandb_mode online

# Or use existing: /data/cameron/para_normalized_losses/libero/checkpoints/act_v2_exp4_n64/best.pth
```

### Step 2: Eval across full viewpoint grid

```python
#!/usr/bin/env python3
"""eval_full_grid.py — Evaluate at all 64 viewpoints, save results + videos."""
import subprocess, numpy as np, json, os, sys, cv2
from pathlib import Path

# Config
CHECKPOINT = "checkpoints/act_baseline/best.pth"  # your checkpoint
EXPERIMENT_NAME = "act_baseline"
N_EPISODES = 3          # per viewpoint
GPU = 4
SAVE_VIDEOS = True

n_views = 8
thetas = np.linspace(0, 25, n_views)
phis = np.linspace(0, 360*(1-1/n_views), n_views)
center_dx, center_dy = 0.0509, -0.2063
dx_min, dx_max = -0.40, -0.01
dy_min, dy_max = -0.30, 0.30

rng = np.random.RandomState(42)
results = {}
out_root = Path(f"results/{EXPERIMENT_NAME}")
out_root.mkdir(parents=True, exist_ok=True)

for vi in range(64):
    ti, pi = vi // n_views, vi % n_views
    theta, phi = thetas[ti], phis[pi]

    # Random object position
    dx = rng.uniform(dx_min, dx_max)
    dy = rng.uniform(dy_min, dy_max)
    shift_dx = center_dx + dx
    shift_dy = center_dy + dy

    vid_flag = "--save_video" if SAVE_VIDEOS else ""
    cmd = (f"CUDA_VISIBLE_DEVICES={GPU} python eval.py --model_type act "
           f"--checkpoint {CHECKPOINT} "
           f"--benchmark libero_spatial --task_id 0 --n_episodes {N_EPISODES} "
           f"--teleport --zero_rotation --clean_scene --max_steps 600 "
           f"--shift_dx {shift_dx} --shift_dy {shift_dy} "
           f"--cam_theta {theta} --cam_phi {phi} "
           f"--out_dir {out_root}/vp_{vi} {vid_flag}")

    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    rate = 0
    for line in result.stdout.split('\n'):
        if 'Success Rate' in line:
            rate = float(line.split(':')[1].strip().replace('%','')) / 100

    results[vi] = {"theta": float(theta), "phi": float(phi),
                   "rate": rate, "ti": ti, "pi": pi}
    print(f"  [{vi+1}/64] θ={theta:.1f}° φ={phi:.1f}° → {rate*100:.0f}%")
    sys.stdout.flush()

# Save results JSON
with open(out_root / "grid_results.json", "w") as f:
    json.dump(results, f, indent=2)

# Print table
print("\n=== Results Grid ===")
print("θ\\φ   " + "  ".join(f"{p:5.0f}" for p in phis))
total_s, total_n = 0, 0
per_theta = {}
for ti, theta in enumerate(thetas):
    row = f"{theta:5.1f}"
    rates = []
    for pi in range(n_views):
        vi = ti * n_views + pi
        rate = results[vi]["rate"]
        total_s += rate * N_EPISODES
        total_n += N_EPISODES
        rates.append(rate)
        row += f"  {rate*100:4.0f}%"
    per_theta[theta] = np.mean(rates)
    print(row)
print(f"\nOverall: {total_s/total_n*100:.0f}%")
print("Per-θ: " + "  ".join(f"{t:.0f}°={r*100:.0f}%" for t,r in per_theta.items()))
```

### Step 3: Generate train/test distribution visualization

```python
#!/usr/bin/env python3
"""gen_distribution_viz.py — Polar plot + sample frames for train vs test."""
import cv2, numpy as np, os, sys
sys.path.insert(0, "/data/cameron/LIBERO")
os.environ.setdefault("LIBERO_DATA_PATH", "/data/libero")
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from scipy.spatial.transform import Rotation as ScipyR
from libero.libero import benchmark as bm_lib, get_libero_path
from libero.libero.envs import OffScreenRenderEnv
import h5py

EXPERIMENT_NAME = "act_baseline"
n_views = 8
thetas = np.linspace(0, 25, n_views)
phis = np.linspace(0, 360*(1-1/n_views), n_views)

# --- Polar plot ---
train_vis = [vi for vi in range(64) if (vi // n_views) == 0]  # theta=0
test_vis = [vi for vi in range(64) if (vi // n_views) != 0]

fig, ax = plt.subplots(1, 1, figsize=(4, 4), subplot_kw=dict(projection='polar'))
fig.patch.set_facecolor('#1a1a1a')
ax.set_facecolor('#1a1a1a')
tl = fl = False
for vi in range(64):
    ti, pi = vi // n_views, vi % n_views
    is_train = vi in train_vis
    color = '#66ff66' if is_train else '#6496ff'
    label = None
    if is_train and not tl: label = f'Train (n={len(train_vis)})'; tl = True
    elif not is_train and not fl: label = f'Test (n={len(test_vis)})'; fl = True
    ax.scatter(np.radians(phis[pi]), thetas[ti], c=color, s=100,
              edgecolors='white', linewidths=0.5, label=label)
ax.set_ylim(0, 30)
ax.set_yticks([0, 5, 10, 15, 20, 25])
ax.set_yticklabels(['0°','5°','10°','15°','20°','25°'], fontsize=8, color='white')
ax.set_title('Default → All Viewpoints', color='white', fontsize=11, pad=15)
ax.tick_params(colors='white')
ax.grid(True, alpha=0.3, color='gray')
ax.legend(fontsize=7, facecolor='#2a2a2a', edgecolor='gray', labelcolor='white')
plt.tight_layout()
fig.savefig(f"results/{EXPERIMENT_NAME}/polar_plot.png", dpi=150,
            bbox_inches='tight', facecolor='#1a1a1a')
plt.close()
print("Saved polar plot")

# --- Sample frames from sim at train/test viewpoints ---
bench = bm_lib.get_benchmark_dict()["libero_spatial"]()
task = bench.get_task(0)
demo_path = os.path.join(get_libero_path("datasets"), bench.get_task_demonstration(0))
bddl_file = os.path.join(get_libero_path("bddl_files"), task.problem_folder, task.bddl_file)
with h5py.File(demo_path, "r") as f:
    init_state = f["data/demo_0/states"][0]

env = OffScreenRenderEnv(bddl_file_name=bddl_file, camera_heights=448, camera_widths=448,
                          camera_names=["agentview"])
env.seed(0); env.reset(); env.env.horizon = 100000
sim = env.env.sim

# Clean scene
for name in ["wooden_cabinet_1_main", "flat_stove_1_main"]:
    sim.model.body_pos[sim.model.body_name2id(name)] = np.array([0, 0, -5.0])
for dn in ["akita_black_bowl_2_main", "cookies_1_main", "glazed_rim_porcelain_ramekin_1_main"]:
    bid = sim.model.body_name2id(dn)
    for gid in range(sim.model.ngeom):
        if sim.model.geom_bodyid[gid] == bid:
            sim.model.geom_rgba[gid][3] = 0.0
sim.forward()

cam_id = sim.model.camera_name2id("agentview")
default_pos = sim.data.cam_xpos[cam_id].copy()
cam_xmat = sim.data.cam_xmat[cam_id].reshape(3, 3)
fwd = -cam_xmat[:, 2]
TABLE_Z = 0.90
t_hit = (TABLE_Z - default_pos[2]) / (fwd[2] + 1e-8)
look_at = default_pos + t_hit * fwd
radius = np.linalg.norm(default_pos - look_at)
default_dir = (default_pos - look_at) / radius
up = np.array([0, 0, 1.0])
right = np.cross(default_dir, up); right /= np.linalg.norm(right)
true_up = np.cross(right, default_dir)

bowl_i, plate_i = 10, 38
center_dx = -init_state[bowl_i]
center_dy = -init_state[bowl_i + 1]
DISTRACTOR_POS = np.array([10.0, 10.0, 0.9])
rng = np.random.RandomState(55)
thumb = 120

def capture(theta_deg, phi_deg, dx, dy):
    s = init_state.copy()
    s[bowl_i] += center_dx + dx; s[bowl_i+1] += center_dy + dy
    s[plate_i] += center_dx + dx; s[plate_i+1] += center_dy + dy
    for qps in [17, 24, 31]: s[qps:qps+3] = DISTRACTOR_POS
    env.set_init_state(s); sim.forward()
    env.env.timestep = 0; env.env.done = False
    theta, phi = np.radians(theta_deg), np.radians(phi_deg)
    offset = (np.sin(theta)*np.cos(phi)*right + np.sin(theta)*np.sin(phi)*true_up + np.cos(theta)*default_dir)
    new_pos = look_at + radius * offset
    f = look_at - new_pos; f /= (np.linalg.norm(f)+1e-12)
    cz = -f; uh = np.array([0.,0.,1.])
    if abs(np.dot(f,uh))>0.99: uh = np.array([0.,1.,0.])
    cx = np.cross(uh,cz); cx /= (np.linalg.norm(cx)+1e-12)
    cy = np.cross(cz,cx); R = np.stack([cx,cy,cz],axis=-1)
    q = ScipyR.from_matrix(R).as_quat()
    sim.model.cam_pos[cam_id] = new_pos
    sim.model.cam_quat[cam_id] = np.array([q[3],q[0],q[1],q[2]])
    sim.forward()
    for _ in range(3): env.step(np.zeros(7, dtype=np.float32))
    obs = env.env._get_observations()
    return cv2.cvtColor(np.flipud(obs["agentview_image"]).copy(), cv2.COLOR_RGB2BGR)

# Train frames (default viewpoint, varied positions)
train_frames = []
for _ in range(6):
    dx = rng.uniform(-0.40, -0.01); dy = rng.uniform(-0.30, 0.30)
    img = cv2.resize(capture(0, 0, dx, dy), (thumb, thumb))
    img[:2,:] = (100,255,100); img[-2:,:] = (100,255,100)
    img[:,:2] = (100,255,100); img[:,-2:] = (100,255,100)
    train_frames.append(img)

# Test frames (varied viewpoints, varied positions)
test_frames = []
for theta_d in [7, 14, 25]:
    for phi_d in [0, 180]:
        dx = rng.uniform(-0.40, -0.01); dy = rng.uniform(-0.30, 0.30)
        img = cv2.resize(capture(theta_d, phi_d, dx, dy), (thumb, thumb))
        cv2.putText(img, f"{theta_d},{phi_d}", (3,13), cv2.FONT_HERSHEY_SIMPLEX, 0.3, (255,255,255), 1)
        img[:2,:] = (100,150,255); img[-2:,:] = (100,150,255)
        img[:,:2] = (100,150,255); img[:,-2:] = (100,150,255)
        test_frames.append(img)

env.close()

# Combine into image
def stack(frames, cols=3):
    rows = []
    for r in range(0, len(frames), cols):
        row = frames[r:r+cols]
        while len(row) < cols: row.append(np.zeros((thumb,thumb,3), dtype=np.uint8))
        rows.append(np.concatenate(row, axis=1))
    return np.concatenate(rows, axis=0)

lbl_h = 22
def label(grid, text, color):
    l = np.zeros((lbl_h, grid.shape[1], 3), dtype=np.uint8)
    cv2.putText(l, text, (4, lbl_h-5), cv2.FONT_HERSHEY_SIMPLEX, 0.35, color, 1)
    return np.vstack([l, grid])

tp = label(stack(train_frames), "TRAIN: default view, random positions", (100,255,100))
ep = label(stack(test_frames), "TEST: all viewpoints, random positions", (100,150,255))
h = max(tp.shape[0], ep.shape[0])
if tp.shape[0]<h: tp = np.vstack([tp, np.zeros((h-tp.shape[0],tp.shape[1],3),dtype=np.uint8)])
if ep.shape[0]<h: ep = np.vstack([ep, np.zeros((h-ep.shape[0],ep.shape[1],3),dtype=np.uint8)])
polar = cv2.imread(f"results/{EXPERIMENT_NAME}/polar_plot.png")
ph = h; pw = int(polar.shape[1]*ph/polar.shape[0])
polar = cv2.resize(polar, (pw, ph))
sep = np.zeros((h, 8, 3), dtype=np.uint8)
combined = np.concatenate([polar, sep, tp, sep, ep], axis=1)
cv2.imwrite(f"results/{EXPERIMENT_NAME}/distribution_overview.png", combined)
print("Saved distribution overview")
```

### Step 4: Generate train data 5×5 image grid

This shows what the training data looks like — all from the default viewpoint with varied positions.

```python
#!/usr/bin/env python3
"""gen_train_grid.py — 5x5 grid of training data frames."""
import cv2, numpy as np
from pathlib import Path

EXPERIMENT_NAME = "act_baseline"
data_root = Path("/data/libero/ood_objpos_v3/libero_spatial/task_0")
thumb = 160
grid_size = 5

# Sample 25 training demos evenly
demo_indices = np.round(np.linspace(0, 255, grid_size * grid_size)).astype(int)

frames = []
for di in demo_indices:
    frames_dir = data_root / f"demo_{di}" / "frames"
    if frames_dir.exists():
        img = cv2.imread(str(sorted(frames_dir.glob("*.png"))[0]))
        img = cv2.resize(img, (thumb, thumb))
        img[:2,:] = (100,255,100); img[-2:,:] = (100,255,100)
        img[:,:2] = (100,255,100); img[:,-2:] = (100,255,100)
    else:
        img = np.zeros((thumb, thumb, 3), dtype=np.uint8)
    frames.append(img)

# Arrange in grid
rows = []
for r in range(grid_size):
    rows.append(np.concatenate(frames[r*grid_size:(r+1)*grid_size], axis=1))
grid = np.concatenate(rows, axis=0)

# Add label
lbl = np.zeros((25, grid.shape[1], 3), dtype=np.uint8)
cv2.putText(lbl, "Training data: default viewpoint, varied positions", (5, 18),
            cv2.FONT_HERSHEY_SIMPLEX, 0.4, (100, 255, 100), 1)
grid = np.vstack([lbl, grid])

out = f"results/{EXPERIMENT_NAME}/train_grid.png"
cv2.imwrite(out, grid)
print(f"Saved: {out}")
```

### Step 5: Generate train vs test side-by-side frame comparison

Shows 3×2 grids of train frames (default viewpoint) next to test frames (varied viewpoints).

```python
#!/usr/bin/env python3
"""gen_train_test_comparison.py — Side-by-side train vs test frame grids."""
import cv2, numpy as np
from pathlib import Path

EXPERIMENT_NAME = "act_baseline"
thumb = 140

# Load training frames (from objpos dataset, default viewpoint)
data_root = Path("/data/libero/ood_objpos_v3/libero_spatial/task_0")
train_frames = []
for di in [50, 128, 200]:  # 3 sample demos
    frames_dir = data_root / f"demo_{di}" / "frames"
    for fi in [0, 15]:  # start frame + mid frame
        img = cv2.imread(str(sorted(frames_dir.glob("*.png"))[fi]))
        img = cv2.resize(img, (thumb, thumb))
        img[:2,:] = (100,255,100); img[-2:,:] = (100,255,100)
        img[:,:2] = (100,255,100); img[:,-2:] = (100,255,100)
        train_frames.append(img)

# Load test frames (from eval results at different viewpoints)
test_frames = []
results_root = Path(f"results/{EXPERIMENT_NAME}")
for vi in [5, 28, 56]:  # 3 viewpoints at different angles
    eval_dir = results_root / f"vp_{vi}"
    vids = list(eval_dir.rglob("*.mp4")) if eval_dir.exists() else []
    if vids:
        cap = cv2.VideoCapture(str(vids[0]))
        for _ in range(2):  # 2 frames per video
            ret, frame = cap.read()
            if ret:
                frame = cv2.resize(frame, (thumb, thumb))
                frame[:2,:] = (100,150,255); frame[-2:,:] = (100,150,255)
                frame[:,:2] = (100,150,255); frame[:,-2:] = (100,150,255)
                test_frames.append(frame)
        cap.release()

# Build grids
def stack(frames, cols=2):
    rows = []
    for r in range(0, len(frames), cols):
        row = frames[r:r+cols]
        while len(row) < cols:
            row.append(np.zeros((thumb, thumb, 3), dtype=np.uint8))
        rows.append(np.concatenate(row, axis=1))
    return np.concatenate(rows, axis=0)

lbl_h = 22
def add_label(grid, text, color):
    lbl = np.zeros((lbl_h, grid.shape[1], 3), dtype=np.uint8)
    cv2.putText(lbl, text, (4, lbl_h-5), cv2.FONT_HERSHEY_SIMPLEX, 0.35, color, 1)
    return np.vstack([lbl, grid])

train_panel = add_label(stack(train_frames), "TRAIN: default viewpoint", (100,255,100))
test_panel = add_label(stack(test_frames), "TEST: varied viewpoints", (100,150,255))

# Match heights
h = max(train_panel.shape[0], test_panel.shape[0])
if train_panel.shape[0] < h:
    train_panel = np.vstack([train_panel, np.zeros((h-train_panel.shape[0], train_panel.shape[1], 3), dtype=np.uint8)])
if test_panel.shape[0] < h:
    test_panel = np.vstack([test_panel, np.zeros((h-test_panel.shape[0], test_panel.shape[1], 3), dtype=np.uint8)])

# Arrow divider
div = np.zeros((h, 30, 3), dtype=np.uint8)
cv2.arrowedLine(div, (3, h//2), (27, h//2), (200,200,200), 2, tipLength=0.4)

combined = np.concatenate([train_panel, div, test_panel], axis=1)
cv2.imwrite(f"results/{EXPERIMENT_NAME}/train_test_comparison.png", combined)
print(f"Saved: results/{EXPERIMENT_NAME}/train_test_comparison.png")
```

### Step 6: Generate eval rollout grid video with success/fail indicators

5×5 video grid of eval rollouts, with green checkmarks for success and red X for failure.
Higher resolution (160px per cell = 800×800 video).

```python
#!/usr/bin/env python3
"""gen_rollout_grid.py — 5x5 rollout grid video with success/fail markers."""
import cv2, numpy as np, os, json, glob
from pathlib import Path

EXPERIMENT_NAME = "act_baseline"
n_views = 8
grid_size = 5
thumb = 160     # higher resolution per cell
max_frames = 32

thetas = np.linspace(0, 25, n_views)
phis = np.linspace(0, 360*(1-1/n_views), n_views)
theta_sample = np.round(np.linspace(0, 7, grid_size)).astype(int)
phi_sample = np.round(np.linspace(0, 7, grid_size)).astype(int)
selected_vis = [ti * n_views + pi for ti in theta_sample for pi in phi_sample]

def extract_frames(video_path, max_f=max_frames):
    cap = cv2.VideoCapture(str(video_path))
    frames = []
    while len(frames) < max_f:
        ret, frame = cap.read()
        if not ret: break
        frames.append(frame)
    cap.release()
    while len(frames) < max_f:
        frames.append(frames[-1] if frames else np.zeros((448,448,3), dtype=np.uint8))
    return frames

def get_success(eval_dir):
    """Check if eval succeeded from JSON or video filename."""
    jp = Path(eval_dir) / "eval_libero_spatial_task0.json"
    if jp.exists():
        d = json.load(open(jp))
        return d["success_rate"] > 0.5
    vids = glob.glob(f"{eval_dir}/videos/task_0/*.mp4")
    return any("success" in os.path.basename(v) for v in vids)

def draw_status(img, success):
    """Draw green checkmark or red X in upper right corner."""
    h, w = img.shape[:2]
    cx, cy = w - 22, 18
    if success:
        cv2.circle(img, (cx, cy), 14, (0, 120, 0), -1)
        pts = np.array([[cx-8, cy], [cx-2, cy+7], [cx+9, cy-7]], dtype=np.int32)
        cv2.polylines(img, [pts], False, (100, 255, 100), 3)
    else:
        cv2.circle(img, (cx, cy), 14, (0, 0, 120), -1)
        cv2.line(img, (cx-7, cy-7), (cx+7, cy+7), (100, 100, 255), 3)
        cv2.line(img, (cx+7, cy-7), (cx-7, cy+7), (100, 100, 255), 3)

results_root = Path(f"results/{EXPERIMENT_NAME}")
all_seqs = []
all_success = []

for ri, ti in enumerate(theta_sample):
    for ci, pi in enumerate(phi_sample):
        vi = ti * n_views + pi
        eval_dir = results_root / f"vp_{vi}"
        vids = list(eval_dir.rglob("*.mp4")) if eval_dir.exists() else []

        success = get_success(str(eval_dir))
        all_success.append(success)

        if vids:
            all_seqs.append(extract_frames(vids[0]))
        else:
            all_seqs.append([np.zeros((448,448,3), dtype=np.uint8)] * max_frames)

# Build grid video
video_frames = []
for t in range(max_frames):
    rows = []
    for r in range(grid_size):
        row_imgs = []
        for c in range(grid_size):
            idx = r * grid_size + c
            img = cv2.resize(all_seqs[idx][t], (thumb, thumb))
            draw_status(img, all_success[idx])
            # Viewpoint label on first frame
            if t == 0:
                ti_val = theta_sample[r]
                pi_val = phi_sample[c]
                label = f"{thetas[ti_val]:.0f},{phis[pi_val]:.0f}"
                cv2.putText(img, label, (4, 14), cv2.FONT_HERSHEY_SIMPLEX, 0.35, (255,255,255), 1)
            row_imgs.append(img)
        rows.append(np.concatenate(row_imgs, axis=1))
    video_frames.append(np.concatenate(rows, axis=0))

# Save as H.264 video
h, w = video_frames[0].shape[:2]
tmp = f"/tmp/grid_{EXPERIMENT_NAME}.mp4"
out = f"results/{EXPERIMENT_NAME}/rollout_grid.mp4"
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
writer = cv2.VideoWriter(tmp, fourcc, 5, (w, h))
for f in video_frames:
    writer.write(f)
writer.release()
os.system(f"ffmpeg -y -i {tmp} -c:v libx264 -preset ultrafast -crf 23 -movflags +faststart {out} 2>/dev/null && rm {tmp}")

n_success = sum(all_success)
print(f"Saved: {out} ({w}x{h}, {n_success}/{len(all_success)} success)")
```

### Step 5: Generate HTML report

```python
#!/usr/bin/env python3
"""gen_report.py — Generate HTML report from results."""
import json, numpy as np
from pathlib import Path

EXPERIMENT_NAME = "act_baseline"
results_dir = Path(f"results/{EXPERIMENT_NAME}")
results = json.load(open(results_dir / "grid_results.json"))

n_views = 8
thetas = np.linspace(0, 25, n_views)
phis = np.linspace(0, 360*(1-1/n_views), n_views)
N_EP = 3

# Per-theta averages
per_theta = {}
total_s, total_n = 0, 0
for ti, theta in enumerate(thetas):
    rates = []
    for pi in range(n_views):
        vi = str(ti * n_views + pi)
        if vi in results:
            rates.append(results[vi]["rate"])
            total_s += results[vi]["rate"] * N_EP
            total_n += N_EP
    per_theta[theta] = np.mean(rates) if rates else 0

overall = total_s / total_n if total_n > 0 else 0

# Build HTML
theta_headers = "".join(f"<th>{t:.1f}°</th>" for t in thetas)
theta_cells = "".join(f"<td>{per_theta[t]*100:.0f}%</td>" for t in thetas)

# Full 8x8 grid as HTML table
grid_rows = ""
for ti, theta in enumerate(thetas):
    cells = f"<td><strong>{theta:.1f}°</strong></td>"
    for pi in range(n_views):
        vi = str(ti * n_views + pi)
        if vi in results:
            rate = results[vi]["rate"] * 100
            # Color: green if >50%, yellow if 20-50%, red if <20%
            color = "#4a4" if rate > 50 else "#aa4" if rate > 20 else "#a44"
            cells += f'<td style="background:{color};color:white;text-align:center">{rate:.0f}%</td>'
        else:
            cells += "<td>--</td>"
    grid_rows += f"<tr>{cells}</tr>\n"

phi_headers = "".join(f"<th>{p:.0f}°</th>" for p in phis)

html = f"""<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>{EXPERIMENT_NAME} — Viewpoint Generalization</title>
<style>
body {{ font-family: sans-serif; max-width: 1200px; margin: 0 auto; padding: 20px; background: #1a1a1a; color: #eee; }}
h1, h2, h3 {{ color: #8cf; }}
table {{ border-collapse: collapse; margin: 1em 0; }}
th, td {{ border: 1px solid #444; padding: 6px 12px; text-align: center; }}
th {{ background: #333; }}
img, video {{ max-width: 100%; margin: 1em 0; border-radius: 4px; }}
.metric {{ font-size: 2em; color: #8f8; font-weight: bold; }}
.caption {{ color: #aaa; font-size: 0.9em; margin-top: -0.5em; }}
</style>
</head>
<body>

<h1>{EXPERIMENT_NAME}</h1>
<p>ACT model trained at default viewpoint (θ=0°) with 64 diverse object positions.
Evaluated across the full 8×8 viewpoint grid with random object positions.</p>

<h2>Overall Result</h2>
<p class="metric">{overall*100:.0f}%</p>
<p>across {total_n} episodes at 64 viewpoints</p>

<h2>Train/Test Distribution</h2>
<img src="distribution_overview.png">
<p class="caption">Left: polar plot (green=train at θ=0°, blue=test). Middle: training frames (default view). Right: test frames (varied viewpoints).</p>

<h3>Training Data (5×5 Grid)</h3>
<img src="train_grid.png">
<p class="caption">25 training samples at default viewpoint with varied object positions.</p>

<h3>Train vs Test Comparison</h3>
<img src="train_test_comparison.png">
<p class="caption">Left: training frames (default view). Right: test frames (OOD viewpoints). Arrow indicates the distribution shift.</p>

<h2>Per-θ Breakdown</h2>
<table>
<tr><th>θ</th>{theta_headers}</tr>
<tr><td><strong>SR%</strong></td>{theta_cells}</tr>
</table>

<h2>Full 8×8 Grid</h2>
<table>
<tr><th>θ \\ φ</th>{phi_headers}</tr>
{grid_rows}
</table>

<h2>Eval Rollout Grid (5×5)</h2>
<video controls preload="metadata"><source src="rollout_grid.mp4" type="video/mp4"></video>
<p class="caption">5×5 grid of eval rollouts. Rows = θ (0° to 25°), columns = φ. Green cells = success, red = failure.</p>

<h2>Reproducibility</h2>
<pre><code>
# Train
CUDA_VISIBLE_DEVICES=4 python train.py --model_type act --run_name {EXPERIMENT_NAME} \\
    --benchmark libero_spatial --task_id 0 \\
    --cache_root /data/libero/ood_objpos_v3_splits/exp4_n64_train \\
    --batch_size 8 --lr 1e-4 --max_minutes 10 --skip_rotation

# Eval (per viewpoint)
CUDA_VISIBLE_DEVICES=4 python eval.py --model_type act \\
    --checkpoint checkpoints/{EXPERIMENT_NAME}/best.pth \\
    --benchmark libero_spatial --task_id 0 --n_episodes 3 \\
    --teleport --zero_rotation --clean_scene --max_steps 600 \\
    --shift_dx 0.0509 --shift_dy -0.2063 \\
    --cam_theta THETA --cam_phi PHI \\
    --out_dir results/{EXPERIMENT_NAME}/vp_VI --save_video
</code></pre>

</body>
</html>
"""

out_path = results_dir / "report.html"
out_path.write_text(html)
print(f"Report saved: {out_path}")
print(f"Open: file://{out_path.resolve()}")
```

### Putting It All Together

```bash
cd /data/cameron/567_augmentation_viewpoint_project

# 1. Train (or skip if using existing checkpoint)
CUDA_VISIBLE_DEVICES=4 python train.py --model_type act --run_name act_baseline ...

# 2. Eval full grid (takes ~30 min with 3 episodes per viewpoint)
python eval_full_grid.py

# 3. Generate all visualizations
python gen_distribution_viz.py       # polar plot + train/test sample frames
python gen_train_grid.py             # 5x5 training data grid
python gen_train_test_comparison.py  # side-by-side train vs test frames
python gen_rollout_grid.py           # 5x5 eval rollout video with checkmarks

# 4. Generate HTML report
python gen_report.py

# 5. View report
# Open results/act_baseline/report.html in a browser
```

Your results directory will contain:
```
results/act_baseline/
├── grid_results.json          # raw results per viewpoint
├── polar_plot.png             # viewpoint polar distribution
├── distribution_overview.png  # polar + train/test sample frames
├── train_grid.png             # 5x5 training data grid
├── train_test_comparison.png  # side-by-side train vs test frames
├── rollout_grid.mp4           # 5x5 eval rollout video with success/fail
├── report.html                # full HTML report
└── vp_0/ ... vp_63/           # per-viewpoint eval outputs + videos
```

---

## 11. Tips

- **Always use `--teleport`** for eval — open-loop delta execution doesn't work well
- **Always use `--shift_dx 0.0509 --shift_dy -0.2063`** for viewpoint evals — this places the object at the position used during training
- **GPUs 4-9** are typically available on the lab server
- The DINOv2 backbone is trainable by default — it fine-tunes during training
- Training for 10 minutes (`--max_minutes 10`) is sufficient for these dataset sizes
- Videos are saved as MPEG-4 by default; re-encode to H.264 for browser playback:
  ```bash
  ffmpeg -i input.mp4 -c:v libx264 -preset ultrafast -crf 23 -movflags +faststart output.mp4
  ```
- The checkpoint saves `min_pos`/`max_pos` normalization values — these are loaded automatically at eval time