doradus-research

Three frontier AI models on four GPUs: hot-swapping at 4.5 seconds

Hot-swap three frontier AI models on the same four consumer GPUs in under five seconds. Used to take fifty. Plain-language breakdown of what changed, why it matters, and a short engineering note for the curious.

Update 2026-06-13. The original model lineup described here (DSv4-Flash + MiMo-V2.5-Flash + GLM-5.1-REAP-478B sharing 4× RTX PRO 6000) is no longer what runs on those four GPUs. DeepSeek-V4-Flash was vacated 2026-06-03 when a K2.6 NVFP4 TP=8 cross-host deployment took over the cards; K2.6 was retired the same evening and Q3.5-397B TP=4 has held the slot since. The rotation pattern + the 4.5-second hot-swap mechanics described below are unchanged — different model cast, same infrastructure. See the follow-up post for the upstream patches that landed since.

If you want to run three large AI models but only have four GPUs, you have a problem. The models do not all fit at once. The usual answer is “buy more GPUs”. Our answer is hot-swapping: keep one model “awake” and serving requests while the others sleep in regular RAM, then swap them in and out in a few seconds when traffic shifts.

This used to take fifty seconds per swap on consumer Blackwell hardware. It now takes 4.5 seconds. Three frontier models — DeepSeek-V4-Flash, MiMo-V2.5-Flash, and GLM-5.1-REAP-478B — share the same four RTX PRO 6000 GPUs. Pick whichever one you need; the other two are asleep in RAM and pay no GPU cost.

What this is useful for

  • Spend less on GPUs. Three models on four cards instead of twelve. The math is straightforward.
  • Swap models with the traffic mix. If different workloads want different models, keep them all installed and let the live load decide which one is awake.
  • A/B test on the same hardware. Run two models on the same cards and alternate between them per request.
  • Skip the cold-load wait. Loading a frontier model from disk takes minutes. Waking one from RAM takes seconds.

Headline numbers

OperationBeforeAfter
Wake a model from sleepneeded a 35-second safety delay~2 seconds
Swap one model out, swap another in~50 seconds~4.5 seconds
Steady-state speed (DSv4-Flash, short context)unchanged~70 tokens/sec

The 4.5 seconds is what a user would experience if their request triggered a model change. It stays in a tight band across ten back-to-back swaps. No drift, no leaks, no slow degradation.

How we use it

We run a rotation pool of frontier models on the same four cards. Whichever model matches the current traffic stays awake; the others sleep. The swap is fast enough that one user request can trigger a model change without a noticeable hang. We rotate based on which workload is dominant in the last few minutes of traffic.

How to get it

The runtime images are on GitHub Container Registry. Anonymous pull, no login needed:

docker pull ghcr.io/doradusresearch/vllm-blackwell-sm12x-bundle:v4
docker pull ghcr.io/doradusresearch/vllm-blackwell-sm12x-bundle:v5

The v4 image runs DeepSeek-V4 (the model with the more complex attention pattern). The v5 image runs MiMo and other frontier models with simpler attention. Both share the same GPUs cleanly. The source repo is being prepared for re-release; the images are enough to reproduce the result today.

The recommended config is at the top of the engineering section below. Apache-2.0 license.


The long version (for engineers who want the receipts)

The interesting case here is what AI people call frontier MoE at TP=4. That is shorthand for: a giant mixture-of-experts model whose weights are split across four GPUs at once, with a routing table that picks which experts to use per token. Each GPU holds about 17 GB of model weights plus the working memory for processing the request. And we want a second model sleeping next to the first one, sharing the same cards.

For months on consumer Blackwell hardware (the SM_120 architecture, no Fabric / no fast GPU-to-GPU interconnect), that combination crashed, hung, or produced garbage outputs after a few rotation cycles.

It now works. /wake_up: about 2 seconds. Cross-peer swap (put one model to sleep, wake another on the same four GPUs): about 4.5 seconds. Down from 50.

Why this case is hard

Sleep/wake for small models on a single GPU is easy — copy the weights to pinned CPU memory, copy them back to VRAM when needed. Done in a fraction of a second. The hard cases are:

  • The model is too big for one GPU. We’re running it on four with tensor parallelism — every layer is split across the four cards and they have to talk during every forward pass.
  • The model has a mixture-of-experts routing table that selects 1–8 expert subnetworks per token. The routing state and the KV cache (attention memory) all have to be saved and restored cleanly.
  • Another model is sharing the same four cards. When one is asleep, it leaves behind some “residue” — pinned memory that can’t be reclaimed until both are awake again. The residue isn’t documented anywhere obvious, so you find out about it by overflowing VRAM.

The setup

SettingDSv4-FlashMiMo-V2.5-Flash
Tensor parallel size44
Max context length131,072 (128K)65,536 (64K)
Max concurrent sequences126
Max batched tokens8,1928,192
GPU memory utilization0.600.85
Sleep mode enabledyesyes

The two models get different gpu_memory_utilization budgets. That asymmetry was the unlock. DSv4 needs more headroom because of a workspace footprint we explain below; MiMo can run at the default. They share the same physical four cards but they tell the allocator different budgets.

First-time cold load: budget about 12 minutes per model from docker run to first serving request. DSv4 has 46 weight shards that load at ~14 seconds per shard from local NVMe, then CUDA graph capture runs. Pulling the 29 GB image on a fresh host adds more. Restarts on the same host are faster (page cache helps) but still dominated by the shard load.

What broke, in order

Five distinct things had to be fixed before this worked. We hit them serially, each one masking the next.

1. A community image had a cumem allocator bug. Our starting point was the voipmonitor b12x fork, a vLLM build patched for consumer Blackwell. Solid baseline, but every cross-peer wake produced CUDA Error: invalid argument at cumem_allocator.cpp:145. The cause: vLLM’s memory allocator asks the GPU driver whether Fabric (a high-speed interconnect) is supported. Consumer Blackwell does not have Fabric. The driver returns an error. vLLM caches the error globally. The next memory operation sees the cached error and fails. The fix is vllm#35489, a one-line reset of the cached error state. It has been open since March and is not in any released image.

2. Repeated sleep/wake cycles leaked GPU memory. Even with the allocator bug fixed, every cycle leaked up to 5 GB. After four or five cycles the leak overlapped with live model weights and the model started producing silent garbage. About seven open vLLM issues track this class of problem. We cherry-pick vllm#34600, which at least cleans up properly when a wake fails partway through.

3. The PR that lets DeepSeek-V4 work also breaks the VRAM budget. vllm#41834 is the upstream effort to support DeepSeek-V4 on consumer Blackwell. It works. It also adds about 22 GB of GPU state per GPU that lives outside the allocator’s normal accounting — sparse-attention workspaces, custom kernel scratch space, CUDA-graph private pools. On data-center cards with 140 GB per GPU, that’s invisible. On 95 GB consumer Blackwell, it overflows.

The math: vLLM’s default gpu_memory_utilization=0.85 × 95 GB = 80 GB accounted budget. Plus 22 GB hidden non-accounted. Plus a few GB for the sleeping peer model’s residue. Total: 105 GB on a 95 GB card. Out of memory.

4. Every config knob you’d reach for to shrink that 22 GB does nothing. We tried capping CUDA graph capture size, disabling graphs entirely, lowering max batched tokens, lowering max sequences, dropping max context length. None of them changed the 22 GB. PyTorch’s OOM message labels this memory as “private pools (e.g., CUDA Graphs)” — the parenthetical led us to spend days chasing graph configuration. The label is technically correct but misleading; the 22 GB is workspace tensors, not graphs.

5. The real fix is one config knob. Once we understood the 22 GB is a hardcoded workspace footprint, the fix becomes obvious — give DSv4 a smaller gpu_memory_utilization. Not 0.85. Not 0.70. 0.60.

The first cut tried 0.70. The math looked right: 70% × 95 GB = 66.5 GB allocator + 22 GB workspace + ~4 GB residue ≈ 92 GB on a 95 GB card. It crashed at startup. Live measurement after the crash showed MiMo’s sleeping-peer residue is closer to 9 GB per GPU, not the 4 GB the vLLM design assumes. So the real math is 66.5 + 22 + 9 = 97.5 GB. Still overflows. Dropping to 0.60 gives 57 + 22 + 9 = 88 GB. About 7 GB of margin. Fits cleanly. Holds across the full 128K context.

Homer Simpson facepalming with a giant D'OH!
Four weeks. Three cherry-picked upstream PRs. Sparse-attention workspace archaeology. Misleading OOM labels. And the unlock was one config knob.

The cost of running at 0.60 is a smaller working budget — less room for weights plus KV cache. DSv4 weights are about 17 GB per GPU. KV cache for 128K context at 12 concurrent sequences is about 7 GB. Total 24 GB, comfortably inside the 66 GB budget. Headroom is fine.

Measured outcome

OperationBefore (community image)After (this stack)
/sleepAsynchronous, returns in 5s, actual teardown takes 35 moreSynchronous, ~2.5s, range 1.8–3.0
/wake_up (cross-peer)Needed a 35-second safety delay to avoid races~2s, range 1.5–2.6, no delay needed
Cross-peer swap total~50s~4.5s, range 4.4–4.5 across three live cycles

4.5 seconds is the headline. For a rotation pool where a single user request can trigger a model change, 50 → 4.5 is the difference between “annoying” and “invisible”.

What’s still slow: per-token speed is communication-bound

Rotation works. Steady-state speed is the next bottleneck. When we profile DeepSeek-V4 decoding tokens at tensor-parallel size 4:

OperationTime spent
sparse_accumulate_indexed_attention117.9 seconds
Cross-GPU all-reduce55.1 seconds

Every token requires the four GPUs to talk to each other and reduce partial results. With sparse attention on this hardware (consumer Blackwell, no Fabric, slower interconnect than data-center cards), the cross-GPU communication is what’s eating wall-clock time. Compute isn’t the limit; the round-trip between cards is.

Practical consequence: for sparse-attention models like DSv4 on this stack, single-user decode is slow. Multi-user concurrent serving is much faster per-cost because the communication overhead gets amortized across many requests. If your workload is latency-sensitive single-user decode of a sparse-attention model specifically, pick a topology with faster GPU-to-GPU interconnect.

Speed across context length

Live measurement against the production pool. Seven different input sizes from short (500 tokens) to long (100K tokens). Each request asked for 256 output tokens, decode-dominated.

Input tokensFirst-token latencyDecode speedEnd-to-end
4770.54s69.7 tok/s4.2s
1,7240.68s68.9 tok/s4.4s
6,6692.09s66.0 tok/s6.0s
26,5359.37s58.6 tok/s13.7s
52,98017.34s51.8 tok/s22.3s
79,46823.06s45.8 tok/s28.6s
99,29121.12s41.9 tok/s27.2s

Decode drops from about 70 to about 42 tok/s as the context grows. That’s the communication overhead in numbers — more state to reduce per token, longer per-token round-trip.

What this means in practice:

  • Chat-length (under 5K tokens): ~70 tok/s decode, ~1s first-token latency. Comfortable for interactive use.
  • Document review (8K–30K tokens): ~60 tok/s decode, 2–10s first-token latency. Workable for batch analysis.
  • Long-context retrieval / agent work (50K–100K tokens): ~45 tok/s decode, 17–23s first-token latency. Useful for offline workloads; user-perceived latency is dominated by reading the input, not generating the output.

Lessons we wish we’d known earlier

  1. PyTorch’s “private pools” label on out-of-memory errors covers more than CUDA graphs. Chasing graph configuration based on this label can waste days.
  2. --enforce-eager is more useful as a diagnostic than as a deployment posture. It cleanly answers “is the OOM coming from graphs?” even if you don’t want to ship with it.
  3. PR #41834’s performance wins come with memory costs that aren’t called out in the PR description. The 22 GB hidden footprint is invisible on data-center cards.
  4. The cycle-leak problem is real but secondary. We chased it for weeks thinking it was the primary blocker. With the upstream fixes plus the 0.60 config, our normal rotation load doesn’t trigger it anymore.
  5. Sleep-mode behavior on consumer Blackwell is sensitive to the model’s attention pattern. What works for one architecture’s KV cache layout doesn’t necessarily work for another. We’ve validated dense MLA and sparse MLA at tensor-parallel size 4. Hybrid attention (linear + sliding-window) on Q3-Coder-Next-80B does not release cleanly yet on the cu129 image — that one stays always-awake. Tracking vllm#41602.

The bundle:v5 / v6 saga (honest status, 2026-05-17)

We tried to unify both models on a single image (bundle:v5). MiMo worked cleanly. DSv4 did not, because of a separate sparse-attention dependency we didn’t know about until it failed.

MiMo on bundle:v5 is stable across three back-to-back cycles:

CycleWakeSleep
12.66s2.99s
22.57s2.92s
32.56s2.94s

Steady-state is about 2.6s wake and 2.95s sleep. The very first sleep of a fresh process is 49s — that’s one-time CUDA graph teardown plus allocator initialization, paid once per process lifetime, not per cycle. We initially reported that 49s as a steady-state number; a reviewer caught the error. The corrected reading is in the table above.

DSv4 on bundle:v5 failed at engine init: Sparse Attention Indexer CUDA op requires DeepGEMM to be installed. There’s a second sparse-attention layer in vLLM that hard-imports deep_gemm and aborts if it’s missing. DeepGEMM doesn’t ship kernels for consumer Blackwell (SM_120). Installing it would fail at runtime instead of init time; the real fix is to patch the layer to use the existing Triton fallback path on SM_120.

We tried to ship that patch as bundle:v6. We did the work: ported the SM_120 Triton kernels from v4 into v5’s source structure, wired SM_120 dispatch into v5’s deep_gemm shim, relaxed the abort condition, routed the affected projection through v4’s dispatcher. The result was a different engine-init failure — “DeepSeek V4 fp8 einsum weight rows must be divisible by out_rank=1024, got 256”. The weight tensor coming out of v5’s loader is laid out per-rank (each GPU holds its slice). v4’s kernels expect the weights laid out group-concatenated (with all the per-group offsets computed inside the kernel). That’s a fundamental layout fork between the two vLLM versions. Recipe tweaks don’t bridge it; the v4 kernels expect a memory layout v5 doesn’t produce. Bridging would require writing new Triton kernels against v5’s layout, which is multi-day porting work, not the overlay patches we’ve been shipping.

So: the production rotation pool is hybrid. DSv4 stays on bundle:v4. MiMo runs on bundle:v5. Cross-peer rotation still works because each model owns its own sleep/wake state — the allocator fix matters per process, not per pool. Both images are on GHCR with anonymous pulls.

One useful architectural observation: bundle:v5 is architecture-agnostic for everything except DeepSeek-V4-class sparse-attention models. The allocator fixes, wake rollback, and other patches are model-independent. Any TP=4 dense, dense-MoE, or non-sparse-MLA model rotates cleanly on bundle:v5 today. The DeepGEMM dependency only shows up for sparse-MLA models.

Multi-cycle rotation proof (live, hybrid v4/v5)

Ten back-to-back swap operations on the live pool. Five full cycles each direction. No restarts, no warmup between swaps:

CycleDirectionSleepWakeTotal
1DSv4 → MiMo1.85s2.97s4.81s
1MiMo → DSv42.97s1.58s4.55s
2DSv4 → MiMo1.83s2.74s4.57s
2MiMo → DSv43.01s1.59s4.60s
3DSv4 → MiMo1.84s2.86s4.71s
3MiMo → DSv43.04s1.53s4.57s
4DSv4 → MiMo1.93s2.73s4.66s
4MiMo → DSv43.14s1.55s4.69s
5DSv4 → MiMo1.90s2.68s4.57s
5MiMo → DSv43.04s1.56s4.60s

All ten succeeded. Cross-peer swap stays in a 4.55–4.81s band across all ten operations. No drift, no leak, no degradation. The MiMo side is symmetric (~3.0s sleep, ~2.8s wake). The DSv4 side is asymmetric (~1.9s sleep, ~1.55s wake) because each model has its own KV cache size and workspace footprint to release.

MiMo decode speed after rotation

After the ten cycles, with DSv4 asleep and MiMo awake, single-user decode at varying input context:

Input tokensFirst-token latencyDecode speedEnd-to-end
45822.96s103.4 tok/s25.44s
1,6340.16s142.7 tok/s1.95s
6,2960.55s139.8 tok/s2.39s
20,2962.12s138.9 tok/s3.96s
38,9582.37s137.5 tok/s4.23s

The 22.96s first-token latency in the first row includes a /wake_up the test script issued before the call — MiMo had just been sleeping. The other four rows are warm. MiMo holds 137–143 tok/s decode essentially flat from 1.6K to 39K tokens. That’s “the cycle leak isn’t leaking” in numbers: decode throughput doesn’t degrade after ten cross-peer rotations.

Acknowledgements

@jasl and @aabbccddwasd for PR #41834, the entire SM_120 DeepSeek-V4 enablement. @haosdent for PR #35489, the one-line allocator fix that took us a week to find. The vLLM cumem allocator authors for the sleep/wake design that makes sub-5-second model swap possible at all. Our small workspace-shrink contribution back is PR #42856.

Bottom line

You can run three frontier mixture-of-experts models on four consumer GPUs and swap between them in under five seconds. The piece that took us four weeks to find was a single config knob (gpu_memory_utilization=0.60 for the model with the sparse-attention workspace) hiding behind a misleading PyTorch error message and three open upstream PRs.

If you’re running similar hardware and trying to do similar rotation, the images are on GHCR. Apache-2.0, anonymous pulls. Feedback welcome — especially benchmarks on hardware we haven’t tested.