# Cosmos-Predict2.5 on DROID

Minimal example running Cosmos-Predict2.5 on a sample from the DROID dataset (used for later finetuning).

## Model note

- **2B distilled**: Text2World only (text-to-video). No image/video conditioning. Use for fast text-only generation.
- **2B post-trained**: Full base model — Text2World, Image2World, Video2World. Use for image/video conditioning (e.g. DROID clips).

This example uses **2B/post-trained** with **Video2World** so we condition on a DROID video clip.

## Setup (one-time)

1. **Conda env and install** (from repo root):

   ```bash
   conda activate cosmos-predict
   export UV_CACHE_DIR=/data/cameron/vidgen/.cache/uv
   export HF_HOME=/data/cameron/vidgen/.cache/huggingface
   uv sync --extra=cu128 --active --inexact
   ```

   If you hit disk quota in home, ensure `UV_CACHE_DIR` and `HF_HOME` point to a directory on a volume with space (e.g. under `/data/cameron/vidgen`).

2. **Hugging Face** (for checkpoint download):

   - Install CLI: `uv tool install -U "huggingface_hub[cli]"`
   - Login: `hf auth login`
   - Accept [NVIDIA Open Model License](https://huggingface.co/nvidia/Cosmos-Guardrail1) and ensure you have access to the Cosmos-Predict2.5 checkpoints.

## Run minimal example on DROID

From repo root with `cosmos-predict` conda env active:

```bash
./run_droid_example.sh
```

Or manually:

```bash
python examples/inference.py \
  -i assets/droid/droid_sample.json \
  -o outputs/droid_video2world \
  --model=2B/post-trained \
  --inference-type=video2world
```

Outputs go to `outputs/droid_video2world/`. The input clip is one DROID MP4 from `/data/weiduoyuan/droid_raw/1.0.1/...`.

To use another DROID clip, edit `assets/droid/droid_sample.json`: set `input_path` to the MP4 path and adjust `prompt` as needed.

## Optional: 2B distilled (Text2World only)

To download and run the 2B distilled checkpoint (text-to-video only):

```bash
python examples/inference.py \
  -i assets/base/robot_pouring.jsonl \
  -o outputs/distilled_text2world \
  --model=2B/distilled \
  --inference-type=text2world
```

(Use a JSON/JSONL that has only `prompt` or `prompt_path`, no `input_path`.)