Augmentation for Viewpoint & Spatial Generalization in Robot Manipulation

CSCI 567 Project Check-In — Can data augmentations improve ACT's robustness to viewpoint and spatial changes?

Contents
  1. Problem Setup & Motivation
  2. Augmentation Design
  3. Spatial Generalization (Camera Translation)
  4. Viewpoint Generalization (Camera Rotation)
  5. Multi-View Training
  6. Video Grid Rollouts
  7. Key Findings & Analysis
  8. Next Steps

1. Problem Setup & Motivation

ACT (Action Chunking with Transformers) is trained at a single default camera viewpoint with 64 diverse object positions. At test time, the camera may be moved (translated or rotated), and objects may be at new positions. The model predicts world-frame 3D positions (viewpoint-invariant targets), so only the visual input changes.

Train vs test viewpoint distribution

Left: polar plot of viewpoint grid (green=train at theta=0, blue=test). Middle: training frames. Right: test frames at varied viewpoints.

Architecture: DINOv2 ViT-S/16 (or ResNet-18) backbone → global feature → MLP → 3D position prediction. We identified and removed a problematic 2D keypoint conditioning input (start_keypoint_2d) that created a train/eval mismatch when augmentation was applied.

2. Augmentation Design

We explored augmentations that simulate viewpoint changes as 2D image transformations, applied consistently across all frames in a trajectory at 50% probability.

Augmentation range exploration

Each row is an augmentation type, columns span min-to-max parameter range.

Crop vs Real Camera Translation

A key finding: crop augmentation is a poor geometric proxy for real camera translation. Crops lack parallax and scale change.

Crop vs camera translation comparison

Left: crop augmentation (2D reframing). Right: real camera translation (3D parallax). The magnitudes don't match — crops simulate much smaller shifts than the eval grid tests.

Directional comparison

Per-direction comparison: camera translation (green) vs crop (blue). The crop preserves perspective while the camera translation creates real parallax.

3. Spatial Generalization (Camera Translation)

We evaluate on a 5x5 grid of camera translations (±15cm horizontal, ±10cm vertical) with fixed centered object position. Multi-stage scoring: PLACE = full task success, GRASP = bowl lifted but not placed, MISS = no grasp.

Single-Viewpoint Training Results

ModelPlaceGraspMissGrasp+Notes
No Augmentation34187Strong at center, drops off quickly
Crop (50%, aggressive)05205Wider grasp spread, no placements
Crop → NoAug Curriculum291411Best single-VP approach

Multi-Stage Translation Grids

No Augmentation (7 grasp+)
Green=PLACE, Yellow=GRASP, Red=MISS
Crop → NoAug Curriculum (11 grasp+)
Widest coverage with 2 placements

Multi-Viewpoint Translation Training

Training data generated at 5 camera positions (center + 4 corners at ±10cm, ±7.5cm), 10 demos each = 50 demos.

ModelPlaceGraspMissGrasp+
Multi-VP + Crop (50%)281510
Multi-VP + All Aug (50%)05205
Multi-VP + No Aug02232
Key finding: Crop augmentation + real multi-view data is the winning combination (10 grasp+). No augmentation with multi-view data performs worse (2 grasp+) — augmentation is essential for bridging gaps between discrete training viewpoints.
Multi-VP + Crop (10 grasp+)
Best multi-view approach
Multi-VP + All Aug (5 grasp+)
Extra augmentations hurt vs crop-only

4. Viewpoint Generalization (Camera Rotation)

Evaluated on the 8x8 theta/phi spherical cap grid (64 viewpoints, 3 episodes each). theta = elevation angle from default.

Models Trained on Translation Data, Evaluated on Rotation

Model0.03.67.110.714.317.921.425.0Overall
Multi-VP + Crop63%33%33%8%8%4%0%0%19%
Multi-VP + All Aug21%4%37%17%4%0%8%0%11%
Multi-VP + No Aug12%21%12%4%0%0%0%0%6%

All augmentations is best at mid-range rotation (theta 7-21) despite being trained only on translation data. The perspective/shear components simulate rotation-like changes.

Multi-VP + Crop on rotation grid (19%)
Multi-VP + All Aug on rotation grid (11%)
Best at mid-range theta

Reference: Models Trained on All-Viewpoint Rotation Data

For comparison, training directly on 640 rotation-viewpoint demos with DINOv2 backbone:

All VP + No Aug (DINOv2)
72%
All VP + Persp Aug (DINOv2)
26%
Default VP Baseline
20%

Real multi-viewpoint rotation data is by far the strongest lever (72%). Augmentation actually hurt when added on top of diverse rotation data.

5. Multi-View Training Details

Rotation Viewpoint Dataset

Viewpoint polar plot

Green = training viewpoints (theta=0), Blue = test viewpoints. The rotation grid spans 0-25 degrees elevation with 8 azimuth angles.

Translation Viewpoint Dataset

Translation augmentation preview

Preview of pure translation augmentation at ±80px (horizontal and vertical).

6. Video Grid Rollouts

Rotation Grid (5x5 subsampled from 8x8)

Rows = theta (elevation), Columns = phi (azimuth). Border color = success rate.

Baseline (default VP, no aug) — 20%
All VP + Persp Aug (DINOv2, 10min) — 26%

Translation Grid (5x5, multi-stage)

Rows = dx (camera right/left), Columns = dy (camera up/down). Green=PLACE, Yellow=GRASP, Red=MISS.

Single-VP No Aug — 7 grasp+
Crop → NoAug Curriculum — 11 grasp+
Multi-VP + Crop — 10 grasp+
Multi-VP + All Aug — 5 grasp+

7. Key Findings & Analysis

8. Next Steps