\documentclass[10pt]{article}

\usepackage[margin=0.75in]{geometry}
\usepackage{times}
\usepackage{microtype}
\usepackage{graphicx}
\usepackage{booktabs}
\usepackage{multirow}
\usepackage{amsmath,amssymb}
\usepackage{hyperref}
\usepackage{xcolor}
\usepackage{caption}
\usepackage{tabularx}
\usepackage{subcaption}
\usepackage{wrapfig}

\title{\vspace{-0.2in}\textbf{PARA: Pixel-Aligned Robot Actions for\\Spatially Robust Manipulation}}
\author{Anonymous Authors}
\date{}

\begin{document}
\maketitle
\vspace{-0.2in}

%% ============================================================
%% ABSTRACT
%% ============================================================
\begin{abstract}
Visuomotor policies trained via behavioral cloning are brittle: modest changes in object placement or camera viewpoint cause dramatic failures, even when the task is unchanged.
We identify \emph{action parameterization} as a key contributor---regressing end-effector coordinates from a pooled image embedding discards spatial structure and couples the policy to viewpoint-specific cues.
We propose \textbf{PARA} (Pixel-Aligned Robot Actions), which predicts actions as dense image-space classifications: a 2D heatmap identifies \emph{where} the end-effector should project in the image, and per-pixel height-bin logits determine \emph{how high} above the support surface.
The 3D target is recovered by intersecting the camera ray with the predicted height plane.
On a real SO-100 robot arm with only 20 demonstrations per task, PARA achieves 97\% on pick-and-place (vs.\ 9\% for coordinate regression), 97\% on towel folding (vs.\ 11\%), and 95\% on table wiping (vs.\ 0\%), while transferring zero-shot to new viewpoints (52\% vs.\ 0\%) and new environments (94\% vs.\ 0\%).
In controlled LIBERO simulation, PARA achieves 54\% on spatial extrapolation where coordinate regression scores 1\%, and 61\% on zero-shot viewpoint transfer (vs.\ 24\%).
Pixel alignment also makes video diffusion models effective action backbones (92\% vs.\ 0\% for global regression on identical features) and enables cross-embodiment transfer via point-track pretraining (66\% vs.\ 10\% from scratch at 10 demos).
\end{abstract}

\vspace{-0.05in}

%% ============================================================
%% FIGURE 1 — OVERVIEW
%% ============================================================
\begin{figure}[t!]
\centering
\includegraphics[width=\linewidth]{figs/svg/para_overview_manual.pdf}
\vspace{-0.15in}
\caption{\textbf{PARA overview.} PARA reformulates end-effector action prediction as a per-pixel heatmap volume over the image. The same formulation transfers across object position, camera viewpoint, and environment changes.}
\label{fig:overview}
\vspace{-0.12in}
\end{figure}

%% ============================================================
%% FIGURE 2 — REAL ROBOT RESULTS (placed early so it floats to top of page 2)
%% ============================================================
\begin{figure}[t!]
\centering
\includegraphics[width=\linewidth]{figs/generated/fig3_realrobot.png}
\vspace{-0.15in}
\caption{\textbf{Real robot results (SO-100, 20 demos per task).} (a)~In-distribution performance: PARA achieves 95--97\% across three tasks; ACT and Motion Tracks fail on precise manipulation. Motion Tracks achieves 61\% on wipe table (a coarse trajectory task). (b)~OOD robustness on pick-and-place: PARA transfers to new viewpoints (52\% zero-shot, 87\% with 5 fine-tuning demos) and new environments (94\%); both baselines collapse.}
\label{fig:real_results}
\vspace{-0.12in}
\end{figure}

%% ============================================================
%% INTRODUCTION
%% ============================================================
\section{Introduction}
\label{sec:intro}

Visuomotor policies trained with behavioral cloning are notoriously brittle: translating an object a few centimeters or repositioning the camera can cause complete failure, even when the underlying task is unchanged.
Foundation models such as DINOv2 now produce spatially rich, shift-equivariant features, yet most policy architectures discard this structure immediately: a global CLS token or spatial average is fed to an MLP that regresses end-effector poses in world coordinates.
This forces the network to implicitly solve correspondence, geometry, and control in a single unstructured output space, encouraging shortcut solutions tied to absolute positions.

Manipulation actions are fundamentally \emph{local} in the image: ``place the teacup'' corresponds to a specific pixel region regardless of viewpoint.
\textbf{PARA} (Pixel-Aligned Robot Actions) restores this locality by predicting a dense heatmap over pixels indicating where the end-effector should project, then classifying height bins at that pixel.
The 3D target is recovered by intersecting the camera ray with the predicted height plane (Figure~\ref{fig:overview}).
This decomposition inherits the spatial equivariance of the encoder: translating the object translates the heatmap, and changing the viewpoint changes the pixel but not the height.

On a real SO-100 robot arm, PARA achieves 95--97\% across three tasks with only 20 demonstrations, while coordinate regression (ACT) scores 0--11\% and motion-track regression scores 5--61\%.
PARA transfers zero-shot to new viewpoints (52\% vs.\ 0\%) and new environments (94\% vs.\ 0\%).
In controlled simulation, PARA outperforms ACT by 20--53 percentage points across six OOD axes.
Pixel alignment also makes video diffusion models effective action backbones (92\% vs.\ 0\% for global regression on the same features) and enables cross-embodiment transfer via point-track pretraining (66\% vs.\ 10\% from scratch).

\paragraph{Contributions.}
\begin{itemize}
    \item A pixel-aligned action formulation that predicts end-effector targets as dense image-space classifications, inheriting encoder equivariance for spatial and viewpoint robustness.
    \item Real-robot experiments on three tasks showing 86--95pp improvements over coordinate regression, with zero-shot viewpoint and environment transfer.
    \item Controlled simulation experiments isolating the action head across six OOD generalization axes.
    \item A video-to-action recipe where PARA heads on video diffusion features achieve 92\% vs.\ 0\% for global regression.
    \item Cross-embodiment transfer via point-track pretraining, achieving 66\% with 10 demos vs.\ 10\% from scratch.
\end{itemize}

%% ============================================================
%% FIGURE 3 — PER-PIXEL METHOD (placed before Related Work so it floats to top of page 3)
%% ============================================================
\begin{figure}[t!]
\centering
\includegraphics[width=\linewidth]{figs/method_perpixel.pdf}
\vspace{-0.15in}
\caption{\textbf{PARA method.} A DINOv2 backbone produces spatial features; a 1$\times$1 conv head predicts a 2D heatmap stacked with per-pixel height-bin logits to form a full 3D volume. The 3D argmax yields the end-effector target in camera coordinates, transformed to the robot frame via known camera extrinsics.}
\label{fig:method_perpixel}
\vspace{-0.12in}
\end{figure}

%% ============================================================
%% RELATED WORK
%% ============================================================
\section{Related Work}
\label{sec:related}

\paragraph{Visuomotor policy learning.}
Behavioral cloning from RGB typically predicts actions in robot coordinates.
ACT predicts action chunks from a CLS-token representation; Diffusion Policy models the action distribution with denoising.
These approaches operate in coordinate space and are sensitive to viewpoint and position shift.

\paragraph{Pixel-aligned prediction.}
Dense, spatially-aligned outputs are standard in 3D vision (depth, flow, correspondence).
Transporter Networks and CLIPort use dense pick-and-place predictions but are limited to top-down planar tasks.
PARA extends pixel-aligned prediction to 6-DoF manipulation via height-bin classification.

\paragraph{Point tracking for robot control.}
Recent approaches use predicted 2D point trajectories as intermediate representations but still regress global coordinates from tracked points.
We compare against a motion-track baseline and show dense pixel classification provides stronger robustness.

\paragraph{Video models for robot control.}
UniPi and SuSIE generate goal-conditioned video and extract actions via inverse dynamics.
PARA heads read actions directly from video diffusion features without a separate inverse model.


%% ============================================================
%% METHOD
%% ============================================================
\section{Method}
\label{sec:method}

\subsection{Problem Setup}
We consider behavioral cloning from demonstrations $\mathcal{D} = \{(I_t, a_t, K, T_{\text{cam}})\}$, where $I_t$ is an RGB image, $a_t$ is the end-effector action (3D position + gripper state), $K$ is the camera intrinsic matrix, and $T_{\text{cam}}$ is the extrinsic transform.
We assume a known support surface defining a world-frame height axis.
The policy predicts the next $N_W$ actions from a single image.

\subsection{Pixel-Aligned Heatmap Volume}

\begin{wrapfigure}{r}{0.33\linewidth}
\vspace{-0.6em}
\centering
\includegraphics[width=\linewidth]{figs/method_height.pdf}
\caption{\textbf{Height is view-invariant.} For the same physical target, depth changes with camera position (0.38\,m vs.\ 0.50\,m), but height above the table is constant. PARA predicts height, making the lifting step camera-invariant.}
\label{fig:method_height}
\vspace{-1.0em}
\end{wrapfigure}

PARA decomposes action prediction into three pixel-space classification problems (Figure~\ref{fig:method_perpixel}).

\paragraph{2D localization.}
A vision encoder $f_\theta$ produces spatial features $F \in \mathbb{R}^{H' \times W' \times C}$.
A 1$\times$1 conv head, bilinearly upsampled to $(H, W)$, produces heatmap logits $Z \in \mathbb{R}^{N_W \times H \times W}$ per timestep.
Supervision is cross-entropy over the flattened $H \times W$ grid:
\begin{equation}
\mathcal{L}_{\text{spatial}} = -\frac{1}{N_W} \sum_{k=1}^{N_W} \log \frac{\exp(Z_k[u_k^*, v_k^*])}{\sum_{u,v} \exp(Z_k[u,v])},
\label{eq:spatial_loss}
\end{equation}
where $p_k^* = (u_k^*, v_k^*)$ is the ground-truth pixel obtained by projecting the demonstrated end-effector position.
At inference, $\hat{p}_k = \arg\max_{u,v}\, Z_k[u,v]$.

\paragraph{Height prediction.}
A second head produces per-pixel logits over $N_H$ height bins: $H_{\text{vol}} \in \mathbb{R}^{N_W \times N_H \times H \times W}$.
During training, height loss is evaluated at the ground-truth pixel (teacher forcing):
\begin{equation}
\mathcal{L}_{\text{height}} = -\frac{1}{N_W} \sum_{k=1}^{N_W} \log \frac{\exp(H_{\text{vol},k}[h_k^*, u_k^*, v_k^*])}{\sum_{j=1}^{N_H} \exp(H_{\text{vol},k}[j, u_k^*, v_k^*])}.
\end{equation}
At inference, height logits are read at the predicted pixel $\hat{p}_k$.

\paragraph{Gripper prediction.}
A third head predicts per-pixel gripper-state logits over $N_G$ bins: $G \in \mathbb{R}^{N_W \times N_G \times H \times W}$, trained identically.

\subsection{3D Recovery via Height-Plane Intersection}
\label{sec:lifting}

Given predicted pixel $\hat{p}_k = (u_k, v_k)$ and height $\hat{h}_k$, we recover the 3D target by intersecting the camera ray through $(u_k, v_k)$ with the plane $z = \hat{h}_k$ in world coordinates.
Predicting height rather than depth is key: height is defined in the world frame (distance above the table) and is invariant to camera position, while depth changes with viewpoint for the same physical point (Figure~\ref{fig:method_height}).

\subsection{Start-Keypoint Conditioning}
The current end-effector position is projected into the image and a learnable embedding is added to the corresponding patch token, providing spatial grounding without explicit robot state input.

\subsection{Training Details}
\label{sec:training}

For backbone experiments, we use DINOv2 ViT-S/16, producing $28 \times 28$ patch features for $448 \times 448$ inputs.
Heads are 1$\times$1 convolutions upsampled bilinearly.
Total loss: $\mathcal{L} = \mathcal{L}_{\text{spatial}} + \mathcal{L}_{\text{height}} + \mathcal{L}_{\text{gripper}}$.
Hyperparameters: $N_W = 12$ timesteps, $N_H = 32$ height bins, $N_G = 32$ gripper bins.


%% ============================================================
%% EXPERIMENTS
%% ============================================================
\section{Experiments}
\label{sec:experiments}

We evaluate PARA on four fronts: real-robot manipulation (Section~\ref{sec:real_robot}), controlled simulation OOD analysis (Section~\ref{sec:sim_ood}), video diffusion as policy backbone (Section~\ref{sec:video_results}), and cross-embodiment transfer via point-track pretraining (Section~\ref{sec:pretrain}).
In all experiments, PARA and baselines share the same vision backbone and training data, isolating action parameterization.

\paragraph{Baselines.}
\textbf{ACT} (Action Chunking with Transformers): CLS token from the shared DINOv2 backbone fed to an MLP regressing $(x,y,z)$ + gripper directly---standard coordinate regression.
\textbf{Motion Tracks}: predicts 2D point tracks across frames and regresses end-effector coordinates from tracked positions.

%% ----- 4.1 REAL ROBOT -----
\subsection{Real Robot Experiments}
\label{sec:real_robot}

\paragraph{Setup.}
We evaluate on an SO-100 robot arm with a single wrist-mounted RGB camera.
All methods use DINOv2 ViT-S/16, trained on 20 kinesthetic demonstrations per task from a single viewpoint, with no data augmentation.
Three tasks: \emph{pick and place} (teacup on saucer), \emph{wipe table}, and \emph{fold towel}.

\paragraph{In-distribution performance.}
Figure~\ref{fig:real_results}a reports task completion rates.
PARA achieves 95--97\% across all three tasks.
ACT achieves at most 11\%, failing to reach correct locations despite identical visual features.
Motion Tracks achieves 61\% on wipe table (a coarse sweeping task) but only 5\% on precise pick-and-place, indicating that sparse point tracking helps with gross motion but lacks precision for fine manipulation.

\paragraph{Out-of-distribution transfer.}
Figure~\ref{fig:real_results}b tests generalization on pick-and-place.
For \emph{zero-shot viewpoint transfer} (camera repositioned, no additional data): PARA 52\%, both baselines 0\%.
With 5 fine-tuning demonstrations at the new viewpoint: PARA 87\%, ACT 4\%, Motion Tracks 0\%.
For \emph{new environment} (different table, background, lighting): PARA 94\%, ACT 0\%, Motion Tracks 6\%.
PARA transfers because its predictions depend on local object appearance, not global scene features.


%% ----- 4.2 CONTROLLED SIMULATION -----
\subsection{OOD Analysis in Simulation}
\label{sec:sim_ood}

The real-robot results show practical impact but confounds make it hard to isolate \emph{why} PARA helps.
We use LIBERO to run controlled experiments where PARA and ACT share identical backbones, data, and evaluation, differing only in the action head.

\paragraph{Setup.}
Task: pick-and-place (bowl on plate), LIBERO spatial task~0.
Teleport servo execution isolates action prediction from controller dynamics.
Object-position dataset: $16 \times 16$ grid, $39 \times 60$\,cm, 256 demos.
Viewpoint dataset: $8 \times 8$ grid ($\theta \in [0^\circ, 25^\circ]$, $\phi \in [0^\circ, 315^\circ]$), 10 demos per viewpoint, 640 total.

%% ============================================================
%% FIGURE 4 — OOD ANALYSIS
%% ============================================================
\begin{figure}[t!]
\centering
\includegraphics[width=\linewidth]{figs/generated/fig4_ood.png}
\vspace{-0.15in}
\caption{\textbf{Controlled OOD analysis (LIBERO simulation).} (a)~Spatial generalization: (i)~train/test position distribution, (ii)~success vs.\ distance from training boundary---PARA degrades gracefully while ACT collapses, (iii)~qualitative comparison at the same OOD position. (b)~Viewpoint generalization: (i)~polar plot of train (green) vs.\ test (blue) viewpoints, (ii)~per-$\theta$ success---PARA holds ${\sim}62\%$ through $17.9^\circ$ while ACT drops to 0\% beyond $14.3^\circ$, (iii)~qualitative comparison at an OOD viewpoint.}
\label{fig:ood}
\vspace{-0.12in}
\end{figure}

\paragraph{OOD object position.}
Train on one half of the position grid, test on the other (Figure~\ref{fig:ood}a).
Left-to-right extrapolation: PARA 54\%, ACT 1\%.
Near-to-far: PARA 46\%, ACT 7\%.
ACT reaches toward memorized training positions; PARA's heatmap tracks the object.
Figure~\ref{fig:ood}a(ii) shows success as a function of distance from the training boundary: PARA degrades gradually while ACT collapses immediately.

\paragraph{OOD camera viewpoint.}
Both models trained at default viewpoint ($\theta = 0^\circ$), tested across the full grid (Figure~\ref{fig:ood}b).
PARA 61\% across all viewpoints; ACT 24\%.
Figure~\ref{fig:ood}b(ii) shows the per-$\theta$ breakdown: PARA maintains ${\sim}62\%$ through $\theta = 17.9^\circ$; ACT degrades monotonically and collapses to 0\% beyond $14.3^\circ$.
Hemisphere transfer (train left, test right): PARA 40\%, ACT 10\%.

\paragraph{Data efficiency and distractors.}
With $N{=}32$ corner demonstrations: PARA 54\%, ACT 33\%.
With dense coverage ($N{=}64$), ACT catches up (71\% vs.\ 68\%), confirming PARA's advantage is specifically OOD.
Distractor robustness (clean train, cluttered test): PARA 60\%, ACT 40\%.

\paragraph{Failure modes.}
ACT fails by reaching to memorized locations (wrong position).
PARA fails on gripper timing (correct reach, drops during transport).


%% ----- 4.3 VIDEO BACKBONE -----
\subsection{Video Diffusion as Policy Backbone}
\label{sec:video_results}

Video diffusion models produce spatially-aligned features that PARA can exploit directly.
We attach PARA heads to the UNet of Stable Video Diffusion (SVD, 7 frames at $576 \times 320$), using concatenated features from decoder up-blocks at $64 \times 64$ resolution.

\paragraph{Two-stage training.}
We pretrain the SVD model for 4K steps (diffusion loss only), then jointly fine-tune with PARA heads for 3K steps using separate learning rates (UNet: $10^{-6}$, PARA: $10^{-4}$).
This achieves 92\% task success, outperforming joint-from-scratch (55\% at 10K steps) with less total compute.

\paragraph{Co-adaptation is essential.}
Frozen video backbone + PARA heads: 0\%.
Video features are spatially informative but not action-relevant without fine-tuning.

\paragraph{PARA vs.\ global regression.}
Replacing PARA heads with global average pooling + MLP on the \emph{same} UNet features with the \emph{same} two-stage training: 0\%.
This is the clearest evidence that pixel alignment, not just strong features, enables video-to-action transfer (Figure~\ref{fig:video}).

%% ============================================================
%% FIGURE 5 — VIDEO BACKBONE
%% ============================================================
\begin{figure}[t!]
\centering
\includegraphics[width=\linewidth]{figs/generated/fig5_video.png}
\vspace{-0.15in}
\caption{\textbf{Video diffusion as policy backbone.} (a)~The same SVD UNet features are fed to two action heads: global regression (average pool $\to$ MLP, collapses spatial structure) vs.\ PARA (conv $\to$ argmax, preserves spatial structure). (b)~Rollout results with 20 demos: PARA achieves 92\% task success; global regression scores 0\%. The \emph{only} difference is the action head.}
\label{fig:video}
\vspace{-0.12in}
\end{figure}


%% ----- 4.4 POINT-TRACK PRETRAINING -----
\subsection{Cross-Embodiment Transfer via Point-Track Pretraining}
\label{sec:pretrain}

PARA's pixel-aligned prediction has a natural connection to point tracking: both reason about \emph{where things are in the image}.
We exploit this by pretraining the PARA backbone on videos with circle overlays---the robot is masked out and replaced with an orange circle at the end-effector position (Figure~\ref{fig:pretrain}a).
This teaches the model to track points across frames without requiring robot-specific action labels, enabling pretraining on diverse embodiments.

\paragraph{Few-shot fine-tuning.}
We pretrain on circle-overlay videos from one embodiment, then fine-tune with PARA heads on a target robot with limited demonstrations.
Figure~\ref{fig:pretrain}b shows results on LIBERO with varying numbers of fine-tuning demos.
At 10 demonstrations, PARA pretrained achieves 66\% vs.\ 10\% from scratch---a $6.6\times$ improvement.
ACT benefits less from the same pretraining (23\% pretrained vs.\ 10\% scratch), confirming that pixel-aligned prediction is better positioned to exploit correspondence-based pretraining.

%% ============================================================
%% FIGURE 6 — POINT-TRACK PRETRAINING
%% ============================================================
\begin{figure}[t!]
\centering
\includegraphics[width=\linewidth]{figs/generated/fig6_pretrain.png}
\vspace{-0.15in}
\caption{\textbf{Cross-embodiment transfer via point-track pretraining.} (a)~Pretraining data: circle overlays mark the end-effector on training videos with the robot masked out, enabling embodiment-agnostic point-tracking supervision. (b)~Fine-tuning on LIBERO: PARA pretrained achieves 66\% with 10 demos vs.\ 10\% from scratch; ACT benefits less from the same pretraining. (c)~Qualitative rollouts after 5-demo fine-tuning on a different embodiment.}
\label{fig:pretrain}
\vspace{-0.12in}
\end{figure}


%% ============================================================
%% DISCUSSION
%% ============================================================
\section{Discussion}
\label{sec:discussion}

\paragraph{Why does pixel-aligned prediction help?}
Coordinate regression maps a global image representation to a global 3D target---a mapping that changes with every shift in camera, position, or layout.
PARA decomposes this into \emph{where in the image} (inherits encoder equivariance) and \emph{at what height} (invariant by construction).
With only 20 demonstrations, this decomposition makes action prediction tractable where global regression cannot memorize the mapping.

\paragraph{When does coordinate regression suffice?}
With dense coverage ($N{=}64$ uniform positions), ACT achieves 71\% vs.\ PARA's 68\%.
PARA's advantage is specifically in the OOD and low-data regimes---precisely where real-robot learning operates.

\paragraph{Limitations.}
Simulation experiments use a single LIBERO task; real-robot experiments cover three tasks on one embodiment (SO-100, a low-cost arm).
PARA assumes a known support surface for height-based lifting.
Teleport servo in simulation bypasses controller dynamics.
Per-position evaluation uses 5 episodes (high variance per position; aggregates are reliable).


%% ============================================================
%% CONCLUSION
%% ============================================================
\section{Conclusion}
\label{sec:conclusion}

We presented PARA, a pixel-aligned action formulation that predicts robot actions as dense image-space classifications.
On a real robot, PARA achieves 95--97\% across three tasks with 20 demonstrations and transfers to new viewpoints and environments where baselines collapse.
Controlled simulation experiments confirm the advantage stems from action parameterization.
Pixel alignment makes video diffusion models effective action backbones (92\% vs.\ 0\% for global regression) and enables cross-embodiment transfer via point-track pretraining (66\% vs.\ 10\% from scratch).
These results suggest that \emph{how} actions are parameterized matters as much as \emph{how} images are encoded.


\vspace{0.1in}
{\small
\paragraph{Acknowledgements.} Placeholder.
}

\end{document}
