# PARA Pitch

## 30-Second Version

Modern image features are multiview-consistent and can find the same object across viewpoints — so why do robot policies still break when you move the camera or shift the object? The problem isn't the features, it's the action head. Standard policies aggregate spatial features through a global token and regress absolute 3D coordinates — sparse supervision, non-equivariant, and must learn the joint distribution over camera pose, scene layout, and robot state from limited data.

PARA treats action prediction as what it actually is: keypoint detection. Instead of regressing coordinates from a pooled vector, we predict a per-pixel heatmap — "where should the gripper be in the image?" — and lift to 3D with height bins and camera geometry. Dense supervision at every pixel, local and equivariant by construction, and no 3D reconstruction needed.

Same backbone, same data: 97% vs 9% on a real robot. 94% vs 0% in a completely new environment.

## Full Pitch (2 Minutes)

### Setup: Features are great, policies aren't

Modern image features are remarkably good. DINOv2, SigLIP — these representations are multiview-consistent, they find the same object across viewpoints, they transfer across scenes. And yet robot policies built on top of them still break when you move the camera 10 degrees or shift the object a few centimeters. They're also not nearly as data-efficient as you'd expect given how good the underlying features are. Why?

### The bottleneck: the action head

The standard recipe takes these rich spatial features, aggregates them through a global token (e.g. a CLS token that attends to all spatial features), and regresses absolute 3D coordinates via an MLP. That introduces two problems:

1. **Sparse supervision** — you get one coordinate label per training image instead of leveraging the full spatial feature map. With 20 demonstrations, that's 20 regression targets.
2. **Hard mapping** — the MLP must learn the joint distribution over camera pose x scene layout x robot state. The same 3D target produces different CLS activations from different viewpoints, and the MLP has to map all of them to the same output. That's a huge function to fit from limited data.

### Why not 3D policies?

The natural reaction is to go 3D — build explicit 3D representations so the policy is viewpoint-invariant by construction. But 3D policies have their own problems:

- They need depth sensors or multi-view reconstruction
- They don't actually help with OOD object positions (several works have shown this)
- They throw away pretrained 2D features and train from scratch

The key insight is that 2D features are *already* mostly multiview-consistent. You don't need to reconstruct 3D to get robustness. You need to use the features you already have in the right way.

### The intuition: it's just keypoint detection

If someone asked you to build a keypoint detector for a new object from a few examples, you wouldn't pool features into a vector and regress (x, y) coordinates. You'd train a spatial heatmap — a per-pixel classifier that asks "is the target here?" at every location. That's obviously the right approach for localization.

Robot action prediction *is* localization — "where should the gripper go in the image?" — but the field has been treating it as coordinate regression.

PARA treats it as what it is: a heatmap prediction problem. The only question is how to lift it to 3D, and the answer is simple — predict height bins along each pixel's ray and use camera geometry. No depth sensor, no 3D reconstruction, no multi-view setup.

### Why this works

- **Dense supervision**: every pixel in every training image gets a gradient signal ("target here / not here"). With 20 demos on a 448x448 image, that's millions of pixel-level training signals instead of 20 coordinate labels.
- **Locality**: each prediction is a local function of nearby spatial features — exactly what pretrained vision models are good at.
- **Equivariance**: when the object moves in the image, the heatmap shifts with it. Built into the architecture, not learned from data.

### Results

Same backbone, same data:
- Real robot (20 demos): **97% vs 9%** (PARA vs ACT)
- New environment never seen: **94% vs 0%**
- Zero-shot new viewpoint: **52% vs 0%**
- Video backbone adaptation: **90% vs 0%** (PARA head vs global regression on same video features)

## Common Questions

### "How is this different from affordances?"

Affordance maps predict a static "where to interact" heatmap per image. PARA predicts full multi-step trajectories in pixel space: 12 timesteps of pixel locations plus height, gripper state, and rotation. It's a complete action representation, not a grasp point selector.

### "Isn't this Lift-Splat-Shoot?"

LSS lifts image features *into* 3D BEV space for perception. PARA does the opposite — stays in 2D for action prediction and only lifts at the very end with one geometric step. LSS says "we need 3D representations to perceive well." PARA says "we don't need 3D to *act* well."

### "Isn't this just RVT?"

RVT renders virtual viewpoints and predicts heatmaps in those views, requiring multi-view depth. PARA operates from a single RGB image with no depth or 3D reconstruction. The contribution is showing you don't need the multi-view pipeline — single-view heatmap + height bins gives you the robustness benefits on a $300 arm with a webcam.

### "Is this novel enough?"

The results speak for themselves: 97% vs 9% is not a marginal improvement, it's a qualitative capability difference. The simplicity is the point — one change to the action head, no new backbone, no new data pipeline, and the policy goes from brittle to robust. The best papers in robot learning (ACT, Diffusion Policy) are similarly simple ideas with strong empirical impact.
