Metadata-Version: 2.4
Name: unified_world_model
Version: 0.1.0
Summary: PyTorch implementation of Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
Author: Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, Abhishek Gupta
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10.14
Description-Content-Type: text/markdown
Requires-Dist: accelerate==0.33.0
Requires-Dist: antlr4-python3-runtime==4.9.3
Requires-Dist: appdirs==1.4.4
Requires-Dist: asciitree==0.3.3
Requires-Dist: certifi==2024.6.2
Requires-Dist: charset-normalizer==3.3.2
Requires-Dist: click==8.1.7
Requires-Dist: cloudpickle==3.0.0
Requires-Dist: dask==2024.6.0
Requires-Dist: decord==0.6.0
Requires-Dist: diffusers==0.30.0
Requires-Dist: dm-tree==0.1.8
Requires-Dist: docker-pycreds==0.4.0
Requires-Dist: einops==0.7.0
Requires-Dist: fasteners==0.19
Requires-Dist: filelock==3.15.3
Requires-Dist: fsspec==2024.6.0
Requires-Dist: ftfy==6.2.3
Requires-Dist: gitdb==4.0.11
Requires-Dist: GitPython==3.1.43
Requires-Dist: huggingface-hub==0.23.4
Requires-Dist: hydra-core==1.3.2
Requires-Dist: idna==3.7
Requires-Dist: imageio==2.21.2
Requires-Dist: imageio_ffmpeg==0.5.1
Requires-Dist: importlib_metadata==7.2.0
Requires-Dist: Jinja2==3.1.4
Requires-Dist: locket==1.0.0
Requires-Dist: markdown-it-py==3.0.0
Requires-Dist: MarkupSafe==2.1.5
Requires-Dist: mdurl==0.1.2
Requires-Dist: moviepy==2.1.1
Requires-Dist: mpmath==1.3.0
Requires-Dist: networkx==3.3
Requires-Dist: numcodecs==0.12.1
Requires-Dist: numpy==1.26.4
Requires-Dist: nvidia-cublas-cu12==12.1.3.1
Requires-Dist: nvidia-cuda-cupti-cu12==12.1.105
Requires-Dist: nvidia-cuda-nvrtc-cu12==12.1.105
Requires-Dist: nvidia-cuda-runtime-cu12==12.1.105
Requires-Dist: nvidia-cudnn-cu12==8.9.2.26
Requires-Dist: nvidia-cufft-cu12==11.0.2.54
Requires-Dist: nvidia-curand-cu12==10.3.2.106
Requires-Dist: nvidia-cusolver-cu12==11.4.5.107
Requires-Dist: nvidia-cusparse-cu12==12.1.0.106
Requires-Dist: nvidia-nccl-cu12==2.19.3
Requires-Dist: nvidia-nvjitlink-cu12==12.5.40
Requires-Dist: nvidia-nvtx-cu12==12.1.105
Requires-Dist: omegaconf==2.3.0
Requires-Dist: packaging==24.1
Requires-Dist: pandas==2.2.1
Requires-Dist: partd==1.4.2
Requires-Dist: pillow==10.3.0
Requires-Dist: promise==2.3
Requires-Dist: protobuf==3.20.3
Requires-Dist: psutil==6.0.0
Requires-Dist: Pygments==2.18.0
Requires-Dist: python-dateutil==2.9.0.post0
Requires-Dist: pytz==2024.1
Requires-Dist: PyYAML==6.0.1
Requires-Dist: regex==2024.5.15
Requires-Dist: requests==2.32.3
Requires-Dist: rich==13.7.1
Requires-Dist: robosuite==1.4.1
Requires-Dist: safetensors==0.4.4
Requires-Dist: scipy==1.14.0
Requires-Dist: sentry-sdk==2.6.0
Requires-Dist: setproctitle==1.3.3
Requires-Dist: six==1.16.0
Requires-Dist: smmap==5.0.1
Requires-Dist: sympy==1.12.1
Requires-Dist: tensorflow==2.15.0
Requires-Dist: tensorflow-datasets==4.9.7
Requires-Dist: tensorflow-metadata==1.16.1
Requires-Dist: timm==1.0.9
Requires-Dist: tokenizers==0.19.1
Requires-Dist: toml==0.10.2
Requires-Dist: toolz==0.12.1
Requires-Dist: torch==2.2.2
Requires-Dist: torchaudio==2.2.2
Requires-Dist: torchvision==0.17.2
Requires-Dist: tqdm==4.66.2
Requires-Dist: transformers==4.44.0
Requires-Dist: triton==2.2.0
Requires-Dist: typing_extensions==4.12.2
Requires-Dist: tzdata==2024.1
Requires-Dist: urllib3==2.2.2
Requires-Dist: wandb==0.19.1
Requires-Dist: wcwidth==0.2.13
Requires-Dist: zarr==2.18.2
Requires-Dist: zipp==3.19.2

# Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

####  [[Website]](https://weirdlabuw.github.io/uwm/) [[Paper]](https://arxiv.org/abs/2504.02792) [[Talk]](https://www.youtube.com/watch?v=WwPRxBbZ4kw)

[Chuning Zhu<sup>1</sup>](https://homes.cs.washington.edu/~zchuning/), [Raymond Yu<sup>1</sup>](https://raymondyu5.github.io/), [Siyuan Feng<sup>2</sup>](https://www.cs.cmu.edu/~sfeng/), [Benjamin Burchfiel<sup>2</sup>](https://scholar.google.com/citations?user=eGoTK1YAAAAJ&hl=en), [Paarth Shah<sup>2</sup>](https://www.paarthshah.me/about), [Abhishek Gupta<sup>1</sup>](https://homes.cs.washington.edu/~abhgupta/)<br/>

<sup>1</sup>University of Washington <sup>2</sup>Toyota Research Institute

This repository provides a PyTorch implementation of Unified World Model (UWM). UWM combines action diffusion and video diffusion to enable scalable pretraining on large, heterogeneous robotics datasets.


## Code structure
* `configs`: Configuration files for pretraining and finetuning experiments.
* `datasets`: Dataset wrappers for DROID, Robomimic, and LIBERO. We standardize all datasets using compressed [Zarr](https://zarr.readthedocs.io/en/stable/) buffers.
* `environments`: Interface wrappers for Robomimic and LIBERO environments.
* `experiments`: Training and evaluation scripts.
* `models`: Model definitions for UWM and baselines.
* `scripts`: Bash scripts for running DROID experiments.


## Setup
Install the package via
```
pip install -e .
``` 
> Note: if you encounter issues using tensorflow-dataset with DROID, consider installing tensorflow-dataset from [source](https://github.com/tensorflow/datasets).

## Robomimic Experiments
To run a Robomimic single-task experiment,
1. Install the [Robomimic](https://github.com/ARISE-Initiative/robomimic) dataset.
2. Update `hdf5_path` and `buffer_path` in the config (e.g., `configs/dataset/robomimic_cap_ph.yaml`).
3. Run:
```
python experiments/uwm/train_robomimic.py --config_name train_uwm_robomimic.yaml dataset=robomimic_can_ph exp_id=singletask
```
This command will generate a Zarr compressed buffer at the `buffer_path` specified in the config file.

## LIBERO Experiments
The LIBERO experiments share most infrastructure with the Robomimic experiments. 

### Pretraining
To pretrain a UWM on LIBERO-90,
1. Install the [LIBERO](https://github.com/Lifelong-Robot-Learning/LIBERO) dataset.
2. Update `hdf5_path` and `buffer_path` in `configs/dataset/libero_90.yaml`.
3. Run:
```
python experiments/uwm/train_robomimic.py --config_name train_uwm_robomimic.yaml dataset=libero_90 exp_id=pretrain
```

### Finetuning
To finetune a pretrained UWM on a downstream LIBERO task (e.g., Book-Caddy),
1. Update `hdf5_path` and `buffer_path` in `configs/dataset/libero_book_caddy.yaml`.
2. Run:
```
python experiments/uwm/train_robomimic.py --config-name finetune_uwm_robomimic.yaml dataset=libero_book_caddy exp_id=finetune pretrain_checkpoint_path="logdir/uwm/libero_90/pretrain/0/models.pt"
```

We release the pretrained LIBERO-90 checkpoint [here](https://drive.google.com/drive/folders/1M4AuVLMRpSwOf_YAp56bV9AqyZI9ul6g?usp=sharing). You can download and directly finetune from this checkpoint.

## DROID Experiments
We provide shell scripts for DROID pretraining / cotraining / finetuning experiments in the `scripts` directory. Each script runs a dataset conversion pipeline to create a Zarr buffer for the corresponding DROID TFDS dataset and then launches training.

### Pretraining
To launch a DROID pretraining experiment, 
1. Install the [DROID](https://droid-dataset.github.io/) dataset
2. Update `DATA_DIR` and `BUFFER_PATH` in `scripts/launch_droid_pretrain.sh`
3. Run:
```
source scripts/launch_droid_pretrain.sh
```

### Cotraining
To launch a video cotraining experiment,
1. Install the [DROID](https://droid-dataset.github.io/) dataset
2. Update `DATA_DIR`, `ROBOT_BUFFER_PATH`, and `VIDEO_BUFFER_PATH` in `scripts/launch_droid_cotrain.sh`
3. Run:
```
source scripts/launch_droid_cotrain.sh
```

### Finetuning
To fineune a pretrained model to a downstream task, 
1. Collect demonstrations using the DROID interface
2. Convert them into a TFDS dataset (via this [pipeline](https://github.com/kpertsch/droid_dataset_builder))
3. Modify and run:
```
source scripts/launch_droid_finetune.sh
```

We release the pretrained and cotrained DROID UWM checkpoints [here](https://drive.google.com/drive/folders/1M4AuVLMRpSwOf_YAp56bV9AqyZI9ul6g?usp=sharing). You can download and directly finetune from these checkpoints.

## Bibtex
If you find this code useful, please cite:

```
@inproceedings{zhu2025uwm,
    author    = {Zhu, Chuning and Yu, Raymond and Feng, Siyuan and Burchfiel, Benjamin and Shah, Paarth and Gupta, Abhishek},
    title     = {Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets},
    booktitle = {Proceedings of Robotics: Science and Systems (RSS)},
    year      = {2025},
}
```
