ACT Viewpoint Generalization

Comparing baseline (no augmentation) vs perspective augmentation, evaluated across all 64 viewpoints (8x8 spherical cap grid)

Train / Test Distribution

Left: polar plot showing train viewpoints (green, theta=0) vs test viewpoints (blue). Middle: sample training frames (default viewpoint, varied object positions). Right: sample test frames (varied viewpoints).

Train vs test viewpoint distribution overview

The model is trained only at the default camera angle (theta=0). At test time, the camera is moved across 64 positions on a spherical cap.

Comparison: Baseline vs Perspective Augmentation

Augmentation: Random perspective warp (H: ±0.15, V: ±0.15) applied to all frames during training. Same training data (64 object positions, default viewpoint), same 10-minute training budget.

Overall

Baseline (no aug)

26%

Perspective Aug

21%

Delta

-5%

Per-Theta Comparison

Theta	0.0	3.6	7.1	10.7	14.3	17.9	21.4	25.0
Baseline	67%	50%	46%	17%	12%	4%	8%	0%
Persp. Aug	54%	33%	38%	21%	21%	0%	0%	4%
Delta	-13%	-17%	-8%	+4%	+9%	-4%	-8%	+4%

Green = augmentation improved over baseline, red = augmentation hurt. The augmentation slightly helped at mid-range theta (10.7-14.3) but hurt at near-range (0-7.1) and didn't help at far-range (17.9-25.0).

Full Grid — Perspective Augmentation

theta \ phi	0	45	90	135	180	225	270	315	Avg
0.0	67%	0%	100%	67%	0%	100%	33%	67%	54%
3.6	67%	0%	0%	0%	0%	67%	67%	67%	33%
7.1	100%	0%	0%	0%	67%	100%	33%	0%	38%
10.7	0%	0%	0%	0%	67%	67%	0%	33%	21%
14.3	100%	0%	0%	0%	33%	0%	0%	33%	21%
17.9	0%	0%	0%	0%	0%	0%	0%	0%	0%
21.4	0%	0%	0%	0%	0%	0%	0%	0%	0%
25.0	0%	0%	0%	0%	0%	0%	33%	0%	4%

Baseline: No Augmentation

Summary

Overall Success Rate

26%

Total Episodes

192

Viewpoints

Episodes / Viewpoint

Training Viewpoint (0deg)

67%

Experiment: ACT model trained on 64 object positions at default viewpoint (theta=0deg). Checkpoint: act_v2_exp4_n64/best.pth. Evaluated with --teleport --zero_rotation --clean_scene --max_steps 600, random object positions per viewpoint (seed=42).

Per-Theta Breakdown

0.0deg

67%

3.6deg

50%

7.1deg

46%

10.7deg

17%

14.3deg

12%

17.9deg

21.4deg

25.0deg

Full 8x8 Viewpoint Grid

theta \ phi	0deg	45deg	90deg	135deg	180deg	225deg	270deg	315deg	Avg
0.0deg	100%	67%	33%	67%	67%	100%	33%	67%	67%
3.6deg	100%	0%	0%	100%	67%	100%	33%	0%	50%
7.1deg	100%	33%	33%	67%	67%	33%	33%	0%	46%
10.7deg	0%	33%	67%	0%	33%	0%	0%	0%	17%
14.3deg	0%	67%	0%	0%	33%	0%	0%	0%	12%
17.9deg	0%	0%	0%	0%	0%	33%	0%	0%	4%
21.4deg	67%	0%	0%	0%	0%	0%	0%	0%	8%
25.0deg	0%	0%	0%	0%	0%	0%	0%	0%	0%

Eval Rollout Grid (5x5)

Each cell shows one eval episode at that viewpoint. Border color indicates success rate.

Key Observations

Performance degrades steeply with increasing theta (elevation offset from training viewpoint)
At the training viewpoint (theta=0deg), the model achieves 67% averaged across all phi angles
By theta=10.7deg, performance drops to 17%
At theta>=17.9deg, the model is near-zero (0-8%)
Phi (azimuth) has high variance but no clear directional bias at low theta
Overall 26% across all 192 episodes (vs README reported 28% with different random seeds)

Reproduction

Checkpoint

/data/cameron/para_normalized_losses/libero/checkpoints/act_v2_exp4_n64/best.pth

Eval Command (per viewpoint)

export PYTHONPATH=/data/cameron/LIBERO:$PYTHONPATH
export DINO_REPO_DIR=/data/cameron/keygrip/dinov3
export DINO_WEIGHTS_PATH=/data/cameron/keygrip/dinov3/weights/dinov3_vits16plus_pretrain_lvd1689m-4057cbaa.pth

CUDA_VISIBLE_DEVICES=4 python eval.py --model_type act \
    --checkpoint /data/cameron/para_normalized_losses/libero/checkpoints/act_v2_exp4_n64/best.pth \
    --benchmark libero_spatial --task_id 0 --n_episodes 3 \
    --teleport --zero_rotation --clean_scene --max_steps 600 \
    --shift_dx SHIFT_DX --shift_dy SHIFT_DY \
    --cam_theta THETA --cam_phi PHI \
    --out_dir results/act_baseline/vp_VI --save_video

Full Grid Script

python eval_full_grid.py   # evaluates all 64 viewpoints, saves results/act_baseline/grid_results.json