# TRI machines

## Training surfaces — summary

| Surface | Mode | Best for |
|---|---|---|
| **Lab server** (`/data/cameron`, school dev box) | Interactive | Familiar env, anything that doesn't need TRI data — but "not recommended" while at TRI (Sergey) |
| **Personal box** (`10.110.23.118`, 2×24GB) | Interactive | Solo experiments, sanity checks, overnight runs you control |
| **DGX cluster** = **SageMaker** (alias `ssh dgx`, compute nodes via ProxyJump) | Interactive Docker + batch | Anything multi-GPU, anything sensitive to TRI data, anything at-scale |

DGX and SageMaker refer to the same physical hardware — just different framings (DGX = the boxes, SageMaker = the job-submission system on top). See `processes/where_to_train.md` for the decision rubric.

## DGX cluster (training)

- **Head node**
  - hostname: `10.110.170.251`
  - alias: `ssh dgx`
  - user: `cameron.smith`
  - groups: docker
  - role: SSH entry point. Don't run heavy jobs here — proxy through to compute nodes.

- **Compute nodes** (jump via head, all under `tri-hq-ml-dgx-*`)
  - `tri-hq-ml-dgx-01` (alias `ssh dgx01`)
  - `tri-hq-ml-dgx-02` (alias `ssh dgx02`)
  - …additional nodes exist; live dashboard shows current status (URL TBD — ask Sergey)
  - **Reservation rule:** put yourself on the reservation list before launching a job.
  - **Job containerization:** Docker, no exceptions (lab convention).

## Personal training box (2×24GB) — `PUGET-232243-01`

- **hostnames:**
  - Tailscale: `100.104.232.94` (alias `ssh dev` — primary, works from anywhere on the tailnet incl. the school server)
  - TRI-LAN: `10.110.23.118` (alias `ssh dev-lan` — only from inside TRI network)
- **user:** `cameronsmith`
- **auth from school server:** ed25519 key (yams pubkey installed 2026-05-26)
- **GPUs:** 2× RTX 3090, 24GB each
- **CPU/RAM:** 64 cores, 251GB RAM, Ubuntu 24.04
- **Role:** Solo experiments, no queue, full attention. Good for "I just need to iterate" runs.
- **Network:** on TRI internal subnet (10.110.x.x) for direct reach to DGX + YAM workstation; reachable from the school-server side via Tailscale overlay (no VPN needed).

## YAM workstation (robot control)

- **YAM control PC**
  - hostname: `10.110.22.11`
  - alias: `ssh robot-lab`
  - user: `robot-lab`
  - role: YAM robot control + data collection + Raiden visualizer
  - sample data: `/home/robot-lab/data/processed`
  - python env: `source ~/raiden/.venv/bin/activate`
  - visualizer launch: `rd visualize --web` (from repo root inside the venv)

## Pending — fill in as you go

- VPN config (if needed from off-site)
- File transfer pattern between DGX and YAM workstation (rsync via head node? S3? shared NFS?)
- Live dashboard URL for DGX node availability
- TRI shared data mounts (if any)
