# Running training on DGX

## Workflow

1. Check the live dashboard for available compute nodes
2. Put yourself on the reservation list for the node
3. SSH to the compute node: `ssh dgx0X`
4. Launch job in Docker (lab convention — bare-host python is discouraged)

## Docker pattern (placeholder — refine after first lab job)

```bash
docker run --gpus all --rm -it \
    -v /home/cameron.smith/code:/workspace \
    -v /home/cameron.smith/data:/data \
    --shm-size=32g \
    <BLESSED_LAB_IMAGE> \
    bash -lc "cd /workspace && python train.py --run_name <name> ..."
```

## Open questions for Sergey

- Which base image does the lab standardize on?
- Standard `-v` mounts (where's scratch, where's shared data)?
- Job-naming convention so multiple users don't collide
- How long can jobs run before they get bumped from the reservation
- Where checkpoints should be written (per-user home? shared scratch?)

## Things to do once

- Once you've run your first successful Docker job, write down the exact command here as the canonical template
- Note any flags or mounts that are non-obvious