# Where to train — decision rubric

Three training surfaces, three different reasons to use each. Default behaviour without thinking = bad outcomes; defaulting to the wrong one for the wrong job = lost days.

## The three surfaces

| Surface | Alias | GPUs | Mode | Network |
|---|---|---|---|---|
| **Lab server** | (your school dev box, `/data/cameron`) | shared | interactive | school internal |
| **Personal box** | `ssh dev` | 2×24GB | interactive, no queue | TRI internal (10.110.23.118) |
| **DGX cluster** (= SageMaker) | `ssh dgx` / `dgx01` / `dgx02` / … | many per node, multi-node available | interactive Docker + batch via SageMaker | TRI internal |

## Decision rubric

### Use the **personal box** when:

- You're iterating on a model, watching it train, killing and restarting frequently
- The job fits in 2×24GB (most PARA-scale runs do)
- You don't want to fight for a reservation
- You want to leave something overnight and just have it work
- You're prototyping a new head / new loss / new data pipeline

This is your *default* for most TRI work. Use this unless you have a specific reason to go bigger.

### Use the **DGX cluster (interactive Docker)** when:

- You need multi-GPU on one node (e.g., larger backbones, distributed data parallel within a node)
- You're running an experiment that needs to be on the TRI side of the data fence (sensitive data, faster shared mounts)
- You want to use the same compute that TRI's other internal jobs use, for fair comparison numbers in the paper
- You want to build the muscle / network of using TRI infra (this is part of why you're there)

Etiquette: reserve the node on the dashboard before launching. Use Docker, not bare-host Python.

### Use **SageMaker (batch job submission)** when:

- You're launching a sweep — many experiments in parallel, no babysitting
- A single run > 24-48 hrs (interactive sessions die from network blips at that scale)
- You're doing a "final" run for the paper that needs a queued, reproducible, log-attached environment
- Cross-node distributed (multi-node DDP)

### Use the **school lab server** when:

- You need to test against existing PARA code/data that lives there
- You're doing something quick (< 1 hr) and don't want the TRI VPN round trip
- You're explicitly NOT touching TRI data and the experiment is local-only

But: **don't default here while at TRI.** Sergey flagged it as "not recommended" because:
- Familiarity makes it the path of least resistance — slippage
- Adds reproducibility variance vs the TRI side
- Using TRI infra is part of why you're at TRI — build the muscle and the network

## Defaults by phase of the internship

| Phase | Default surface |
|---|---|
| Weeks 1-2 (setup, reproducing OOD on YAM) | Personal box + DGX for first Docker smoke-test |
| Weeks 3-5 (wrist cam, backbone ablations) | Personal box for prototyping, DGX interactive for the backbone sweep |
| Weeks 6-9 (desk-organize headline demo) | Personal box for iteration, DGX interactive for the final-pass training |
| Weeks 10-12 (paper polish + video) | SageMaker for any final / reproducible runs that go in the paper |

## Anti-patterns to avoid

- **Defaulting to the lab server because it feels familiar.** First 2 weeks, every time you reach for it, stop and ask "is this actually the right call?"
- **Holding a DGX reservation overnight when you're not using it.** Other internships are watching. Free the node when you're done.
- **Running anything sensitive on the school server.** Data egress + reviewer reproducibility risk.
- **Going to SageMaker too early.** It's for runs you don't need to watch. Interactive boxes are for runs you do.

## Pending — fill in as you learn

- The actual `sm` / `aws` CLI commands or web UI for submitting SageMaker jobs
- TRI's Docker base image (so personal-box ↔ DGX runs can be reproduced cleanly)
- Whether the personal box has the same `/data` mount as DGX (probably not — note any data-movement steps required)
