ACT Viewpoint Generalization

Comparing baseline (no augmentation) vs perspective augmentation, evaluated across all 64 viewpoints (8x8 spherical cap grid)

Train / Test Distribution

Left: polar plot showing train viewpoints (green, theta=0) vs test viewpoints (blue). Middle: sample training frames (default viewpoint, varied object positions). Right: sample test frames (varied viewpoints).

Train vs test viewpoint distribution overview

The model is trained only at the default camera angle (theta=0). At test time, the camera is moved across 64 positions on a spherical cap.

Comparison: Baseline vs Perspective Augmentation

Augmentation: Random perspective warp (H: ±0.15, V: ±0.15) applied to all frames during training. Same training data (64 object positions, default viewpoint), same 10-minute training budget.

Overall

Baseline (no aug)
26%
Perspective Aug
21%
Delta
-5%

Per-Theta Comparison

Theta 0.03.67.110.714.317.921.425.0
Baseline 67%50%46%17%12%4%8%0%
Persp. Aug 54%33%38%21%21%0%0%4%
Delta -13% -17% -8% +4% +9% -4% -8% +4%

Green = augmentation improved over baseline, red = augmentation hurt. The augmentation slightly helped at mid-range theta (10.7-14.3) but hurt at near-range (0-7.1) and didn't help at far-range (17.9-25.0).

Full Grid — Perspective Augmentation

theta \ phi 04590135180225270315 Avg
0.0 67% 0% 100% 67% 0% 100% 33% 67% 54%
3.6 67% 0% 0% 0% 0% 67% 67% 67% 33%
7.1 100% 0% 0% 0% 67% 100% 33% 0% 38%
10.7 0% 0% 0% 0% 67% 67% 0% 33% 21%
14.3 100% 0% 0% 0% 33% 0% 0% 33% 21%
17.9 0% 0% 0% 0% 0% 0% 0% 0% 0%
21.4 0% 0% 0% 0% 0% 0% 0% 0% 0%
25.0 0% 0% 0% 0% 0% 0% 33% 0% 4%

Baseline: No Augmentation

Summary

Overall Success Rate
26%
Total Episodes
192
Viewpoints
64
Episodes / Viewpoint
3
Training Viewpoint (0deg)
67%
Experiment: ACT model trained on 64 object positions at default viewpoint (theta=0deg). Checkpoint: act_v2_exp4_n64/best.pth. Evaluated with --teleport --zero_rotation --clean_scene --max_steps 600, random object positions per viewpoint (seed=42).

Per-Theta Breakdown

0.0deg
67%
3.6deg
50%
7.1deg
46%
10.7deg
17%
14.3deg
12%
17.9deg
4%
21.4deg
8%
25.0deg
0%

Full 8x8 Viewpoint Grid

theta \ phi 0deg45deg90deg135deg180deg225deg270deg315deg Avg
0.0deg 100% 67% 33% 67% 67% 100% 33% 67% 67%
3.6deg 100% 0% 0% 100% 67% 100% 33% 0% 50%
7.1deg 100% 33% 33% 67% 67% 33% 33% 0% 46%
10.7deg 0% 33% 67% 0% 33% 0% 0% 0% 17%
14.3deg 0% 67% 0% 0% 33% 0% 0% 0% 12%
17.9deg 0% 0% 0% 0% 0% 33% 0% 0% 4%
21.4deg 67% 0% 0% 0% 0% 0% 0% 0% 8%
25.0deg 0% 0% 0% 0% 0% 0% 0% 0% 0%

Eval Rollout Grid (5x5)

Each cell shows one eval episode at that viewpoint. Border color indicates success rate.

Key Observations

Reproduction

Checkpoint

/data/cameron/para_normalized_losses/libero/checkpoints/act_v2_exp4_n64/best.pth

Eval Command (per viewpoint)

export PYTHONPATH=/data/cameron/LIBERO:$PYTHONPATH
export DINO_REPO_DIR=/data/cameron/keygrip/dinov3
export DINO_WEIGHTS_PATH=/data/cameron/keygrip/dinov3/weights/dinov3_vits16plus_pretrain_lvd1689m-4057cbaa.pth

CUDA_VISIBLE_DEVICES=4 python eval.py --model_type act \
    --checkpoint /data/cameron/para_normalized_losses/libero/checkpoints/act_v2_exp4_n64/best.pth \
    --benchmark libero_spatial --task_id 0 --n_episodes 3 \
    --teleport --zero_rotation --clean_scene --max_steps 600 \
    --shift_dx SHIFT_DX --shift_dy SHIFT_DY \
    --cam_theta THETA --cam_phi PHI \
    --out_dir results/act_baseline/vp_VI --save_video

Full Grid Script

python eval_full_grid.py   # evaluates all 64 viewpoints, saves results/act_baseline/grid_results.json