# Tweet Thread Draft — PARA

## Core Hook

Why do image features generalize beautifully across viewpoints and scenes — but robot policies built on those same features break when you move the camera 10°?

We built a method that hugs the features more closely. The result: policies that actually inherit the robustness of the backbone.

## Thread Structure (draft)

**Tweet 1 (hook + video):**
Modern image features (DINO, SigLIP) are multiview-consistent — they find the same object from any angle. But robot policies built on them still break when you move the camera or shift the object. Why?

[attach: side-by-side ACT failing in new environment vs PARA succeeding]

**Tweet 2 (the gap):**
The problem: standard policies take these rich spatial features and immediately collapse them into a single vector to regress XYZ coordinates. One step undoes everything the backbone learned.

**Tweet 3 (the fix):**
PARA predicts actions in pixel space — a heatmap over the image asking "where should the gripper be?" This hugs the spatial features closely instead of throwing them away. Dense supervision at every pixel, equivariant by construction.

[attach: method diagram or heatmap visualization]

**Tweet 4 (real robot):**
Same backbone, same data. 20 demos. 
PARA: 97%. ACT: 9%.
New environment it's never seen: 94% vs 0%.

[attach: real robot video wall — 3 tasks]

**Tweet 5 (viewpoint):**
Trained at one camera angle. PARA holds 62% through 18° of viewpoint shift. ACT collapses to 0%.

[attach: per-theta chart]

**Tweet 6 (video backbone):**
Video models predict future pixels. PARA reads off actions in that same space. 90% vs 0% for global regression on the same video features.

[attach: rollout grid comparison]

**Tweet 7 (pretraining):**
Because PARA supervises in pixel space, it can pretrain from any video — even without a robot. Pretrain on circle-tracking data, fine-tune with 5 robot demos: 42% vs 0%.

[attach: circle overlay training frames + bar chart]

**Tweet 8 (link):**
Paper: [link]
Project page: [link]
One change to the action head. No new backbone. No depth sensor. Just the right inductive bias.

## Catchy One-Liners (for website/social)

- "Image features generalize. Your action head doesn't. PARA fixes that."
- "Stop throwing away your backbone's spatial structure."  
- "Pixel-aligned actions: hug the features, inherit the robustness."
- "One change to the action head: 97% vs 9%."
