# TRI Meeting with Sergey — April 14, 2026

## Vision
Build a foundation model for pixel-aligned robot actions that goes beyond action primitives — leveraging geometric backbones, multi-view reasoning, large-scale pretraining, and mobile manipulation to create a system that actually works robustly in the real world.

---

## Research Threads

### 1. Geometric Backbones for Better Lifting

![DUSt3R pointmap example](https://raw.githubusercontent.com/naver/dust3r/main/assets/dust3r.jpg)

Explore replacing PARA's height-bin lifting with learned geometric models:
- **DUSt3R / MASt3R** — dense pointmaps from image pairs, gives per-pixel 3D without camera calibration
- **MoGe** — monocular geometric estimation, single-image pointmaps
- **Depth Anything / Metric3D** — monocular metric depth

Integration: PARA predicts 2D heatmap (where in the image) → lookup 3D from geometric model output (instead of height bins). Removes flat-table assumption, enables shelf/drawer/non-planar manipulation.

Key question: does pixel-aligned prediction + geometric lifting beat direct 3D regression from the same geometric features?

---

### 2. Multi-View Cost Volume Integration

![Multi-view stereo cost volume](https://raw.githubusercontent.com/ArthasMil/AACVP-MVSNet/main/imgs/NetwordStructure.jpg)

Aggregate heatmap predictions from multiple cameras into a 3D cost volume:
- Each camera produces a per-pixel heatmap volume along its rays
- Fuse in shared 3D space (known extrinsics) → intersection gives sharp 3D target
- Connects to MVS literature (MVSNet, etc.) — multi-view stereo but for actions

Different from RVT: predict in REAL viewpoints and fuse, no virtual rendering. Works with arbitrary camera configurations. Resolves depth ambiguity that single-view PARA can't.

---

### 3. Large-Scale Egocentric & Point Track Pretraining

![Ego4D egocentric manipulation](https://ego4d-data.org/assets/images/h+o.png)

Pretrain PARA's heatmap objective on massive egocentric video data:
- **Data sources:** Ego4D, Epic-Kitchens, Something-Something, DROID
- **Pretraining signal:** point tracks from CoTracker/TAPIR give pixel trajectories → direct heatmap supervision
- **The advantage:** PARA's pixel-aligned objective is embodiment-agnostic — supervision is "where does the hand go in the image," which transfers across human hands, robot grippers, any manipulator
- **Scale:** millions of video clips → pretrain universal "interaction point predictor" → fine-tune on specific robot with minimal data

Current preliminary: arm-deleted pretraining experiment showing cross-embodiment transfer (results incoming from backbones agent).

---

### 4. Video Model Pretraining for Robot Policy

![Stable Video Diffusion frames](https://raw.githubusercontent.com/Stability-AI/generative-models/main/assets/000.jpg)

Scale up the video-to-policy pipeline:
- Current result: SVD + PARA achieves 90% vs 0% for global regression on same features
- **Scale up:** larger video models (Cosmos, Sora-class), more diverse training data, longer horizon generation
- **PARA as the natural bridge:** video models predict pixels, PARA reads actions in pixel space — this alignment improves with scale
- **Foundation model angle:** pretrain one large video model on diverse robot data, attach PARA heads per-task with minimal fine-tuning

---

### 5. Mobile Manipulation

![Mobile manipulator example](https://docs.hello-robot.com/0.3/getting_started/images/stretch3_full.png)

Extend PARA to mobile manipulation platforms:
- **Navigation:** use posed imagery of the scene + keypoints/pointmaps from DUSt3R for navigation planning
- **Manipulation:** PARA for the manipulation part (pixel-aligned, viewpoint-robust)
- **Unified representation:** PARA could predict targets for both "where to drive" and "where to grasp" in the same pixel space
- **PARA's viewpoint robustness is critical here:** camera pose changes continuously as the robot moves — height-based lifting handles this naturally (height is world-frame, not camera-frame)
- **Platforms:** Stretch, DexMate, or TRI's own mobile platforms

---

## Unifying Story

PARA separates **"what to interact with"** (2D pixel prediction — transferable, equivariant, pretrainable at scale) from **"how to get there in 3D"** (geometric lifting — modular, per-setup). This decomposition means:

1. The 2D prediction benefits from all the above threads (better backbones, more pretraining data, video models)
2. The 3D lifting benefits from Sergey's group's geometric expertise (pointmaps, cost volumes, multi-view)
3. Together: a foundation model that predicts robust pixel-aligned actions from any view, lifts to 3D with the best available geometry, and scales with data

The goal is not another action primitive model — it's a system where adding more data, more views, and better geometry all make the policy better, because the architecture is designed for it.
