# Inference Guide This guide provides instructions on running inference with Cosmos-Predict2.5/base models.

cosmos-predict-diagram

## Prerequisites 1. [Setup Guide](setup.md) ## Example Run inference with example asset: ```bash python examples/inference.py -i assets/base/robot_pouring.json -o outputs/base_video2world --inference-type=video2world ``` To enable multi-GPU inference with 8 GPUs, use [torchrun](https://docs.pytorch.org/docs/stable/elastic/run.html): ```bash torchrun --nproc_per_node=8 examples/inference.py -i assets/base/robot_pouring.json -o outputs/base_video2world --inference-type=video2world ``` To generate longer videos with autoregressive sliding window mode: ```bash python examples/inference.py -i assets/base/bus_terminal_long.json -o outputs/autoregressive ``` To change the inference type, pass `--inference-type`: | Variant | Arguments | | --- | --- | | Text2World | `--inference-type=text2world` | | Image2World | `--inference-type=image2world` | | Video2World | `--inference-type=video2world` | To run all example assets: ```shell torchrun --nproc_per_node=8 examples/inference.py -i assets/base/*.json -o outputs/base ``` To change the model, pass `--model`: | Model | Arguments | Notes | | --- | --- | --- | | 2B | `--model=2B/post-trained` | | | 2B/distilled | `--model=2B/distilled --inference-type=text2world` | Only Text2World supported. | | 14B | `--model=14B/post-trained` | | To see all available options: ```bash python examples/inference.py --help ``` Parameters are specified as json: ```jsonc { // Inference type: text2world, image2world, video2world "inference_type": "video2world", // Sample name "name": "robot_pouring", // Input prompt "prompt": "A robotic arm, primarily white with black joints and cables...", // Path to the input image/video file (not needed for text2world) "input_path": "robot_pouring.mp4" } ``` ### Outputs #### text2world/snowy_stop_light

#### image2world/robot_welding

#### video2world/sand_mining

## Tips ### Multi-GPU Context parallelism distributes inference across multiple GPUs, with each GPU generating a subset of the video frames. * The number of GPUs should ideally be a divisor of the number of frames in the generated video. * All GPUs should have the same model capacity and memory. * Context parallelism works best with the 14B model where memory constraints are significant. * Requires NCCL support and proper GPU interconnect for efficient communication. * Significant speedup for video generation while maintaining the same quality. ### Prompt Engineering For best results with Cosmos models, create detailed prompts that emphasize physical realism, natural laws, and real-world behaviors. Describe specific objects, materials, lighting conditions, and spatial relationships while maintaining logical consistency throughout the scene. Incorporate photography terminology like composition, lighting setups, and camera settings. Use concrete terms like "natural lighting" or "wide-angle lens" rather than abstract descriptions, unless intentionally aiming for surrealism. Include negative prompts to explicitly specify undesired elements. The more grounded a prompt is in real-world physics and natural phenomena, the more physically plausible and realistic the generated image will be.