Doradus Research
Notes from running an on-prem AI cluster — consumer + workstation-class GPUs, multi-vendor inference stacks, multi-model serving. Operator perspective: what actually breaks, what the docs don't say, and the configurations that ended up working in production.
Code at github.com/DoradusResearch. Hardware: 3 GPU compute nodes carrying 10× RTX PRO 6000 Blackwell (95 GiB each) + 4× RTX 5090, 2× DGX Spark (GB10, 128 GiB UMA), 2× Mac Studio M3 Ultra (256 GiB UMA each). ~1.3 TB system RAM, ~75 TB tiered storage across four tiers — hot NVMe, erasure-coded warm cluster, shared NFS model cache, SMB cold archive. 100GbE fiberoptic backbone, modern scheduler + service mesh with mTLS, perimeter firewall appliance. All on-prem.
recent posts
- 002 Sleep mode on Blackwell, part 2: catching CUDA error 700 live + a generic Triton shmem-budget helper
We validated upstream vLLM PR #43020 on RTX PRO 6000 — sleep/wake cycles that previously hard-reset the engine now drain pending CUDA work before cuMemUnmap, fixing a race we caught in production. GLM-5.1-REAP, retired from our TP=4 rotation pool in April for exactly this symptom, is back. And we are shipping a generic Triton autotune shmem-budget helper upstream that replaces hand-rolled check_shared_mem() bucket switches with per-config precision.
- 001 Frontier MoE sleep/wake at TP=4 on consumer Blackwell — 2s wake, 4.5s swap
DeepSeek-V4-Flash + MiMo-V2.5-Flash rotation on 4× RTX PRO 6000. Cross-peer swap 50s down to 4.5s. The sparse-MLA workspace, the SWA release bug, the cumem race, and the one config knob that made it work.