Sleep mode on Blackwell, part 2: catching CUDA error 700 live + a generic Triton shmem-budget helper

Update 2026-06-13. The GLM-5.1-REAP-478B-A42B “back in production” claim below applied to the 2-peer DSv4 + MiMo rotation only. Later the same day (2026-05-19) a separate regression on the glm51 image showed up in 3-peer rotation: /sleep returns HTTP 200 but does not actually release GPU memory, so the next peer wake hits cumem_allocator.cpp:139 and 500s. Root cause is path divergence — the glm51 image installs vLLM under /opt/vllm/ instead of dist-packages, so the bind-mount overlay for vllm#43020 wasn’t being loaded at runtime. Workaround in effect today: GLM runs in 2-peer rotation with DSv4 or MiMo, not 3-peer with both. A glm51 rebuild against latest vLLM main with #43020 cherry-picked is the durable fix; not yet done. The rest of the post (allocator stream-aware free, Triton shmem helper, validation results on DSv4/MiMo) is unchanged.

Upstream PR status (2026-06-13 sweep): #43020 still merged. #34600, #42856, #43047, #41602, #35489 all still open with no state change since publication. PR #43020 has not yet been baked into a tagged vLLM release, so the bind-mount overlay pattern at the bottom is still the operational path.

A few months ago we retired one of our frontier models (GLM-5.1-REAP-478B) from production because it kept crashing the GPU every few hot-swap cycles. We thought we had lost it on consumer Blackwell hardware. Upstream landed a one-line fix this week. We validated it on our cluster. The model is back in the live rotation.

What this means in plain terms

One more frontier model is usable on consumer hardware. GLM-5.1-REAP-478B is a 478B-parameter mixture-of-experts model. It runs on 4× RTX PRO 6000 alongside DeepSeek-V4-Flash and MiMo-V2.5-Flash. None of them have to be evicted to run the others.
The crash was a 1-in-N timing bug. Every few hot-swap cycles the GPU would hard-reset the engine with a cryptic CUDA error. The fix is one line of Python — wait for the GPU to finish what it is doing before tearing the memory down. Upstream PR is vllm-project/vllm#43020. We tested it; it works.
We are sending a Triton helper upstream. A small utility that lets autotune pick the optimal GPU kernel config without crashing on consumer cards with less shared memory than data-center cards. Useful for anyone writing custom kernels.

Headline numbers

What	Before (April)	After (this week)
GLM-5.1-REAP-478B in rotation pool	retired (would crash)	live, no observed crashes
Cross-peer swap cycles before crash	~4–5 cycles	unbounded across normal operation

The long version (for engineers who want the receipts)

Follow-up to “Three frontier AI models on four GPUs”. That post described the production state. This one walks through the two upstream changes that made it durable.

The earlier post left four upstream pull requests open — fixes we were carrying as local patches while waiting on upstream review. One more landed as a draft after that post went up: vllm#43020 — Make the memory allocator stream-aware.

The fix is a single line, but the bug it closes is the exact reason GLM-5.1-REAP-478B-A42B-NVFP4 got pulled from production in April. The symptom from operations was simple: needs a hard reset every N uses. We re-tested it this week with the patch applied. It survives.

What the fix actually does (in plain terms)

vLLM’s sleep-mode allocator returns GPU memory to the OS the moment a tensor’s reference count drops to zero. PyTorch’s regular allocator doesn’t — it waits for any in-flight GPU work touching that memory to finish first. The sleep-mode allocator skipped that wait. When a tensor was freed while a kernel was still computing against it, the GPU and the host raced. Sometimes the GPU won; sometimes you got a corrupt allocator state that crashed the next wake-up.

The patch is one line: tell PyTorch to wait for pending GPU work before releasing the memory. It runs only when the sleep-mode allocator frees something (model load, KV cache init, sleep/wake), so steady-state inference loses nothing.

 def _python_free_callback(self, ptr: int) -> HandleType:
     """Drain pending CUDA work before unmapping pool-backed storage."""
+    torch.cuda.synchronize()
     data = self.pointer_to_data.pop(ptr)

We caught the bug during validation

We ran a clean cleanup sequence: sleep DSv4 → boot GLM (running its own unpatched allocator) on the same four GPUs → stop GLM → wake DSv4. The wake failed:

{"error":{"message":"Call to wake_up method failed: Worker failed with error
'CUDA Error: invalid argument at cumem_allocator.cpp:146'", ...}}

That’s the exact bug class #43020 fixes. Running an unpatched second model on the same physical GPUs left DSv4’s allocator in a state it couldn’t recover from. Hard restart cleared it. With #43020 in place, the same sequence doesn’t corrupt the allocator in the first place.

Validation results

We tested without rebuilding any container images. The patched allocator file is small enough to overlay as a single read-only bind-mount on the Nomad job. Less risk than a full image rebuild, faster turnaround.

Test 1 — Q3.6-27B sleep/wake cycle

/sleep level=1     → returned 200 in 39.4s
VRAM:              58101 → 32307 MiB (25.8 GiB released, engine still healthy)
/wake_up           → returned 200 in 1.9s
VRAM:              58149 MiB
post-wake chat:    coherent, no garbage tokens

Test 2 — GLM-5.1-REAP-478B stress test

The retirement bug needed multiple uses to show up, so we ran 30 back-to-back chat completions against a freshly resurrected GLM on the same four cards DSv4 lives on:

=== 30-request stress test ===
[01] ok  2.2s | 17 + 25 = 42
[02] ok 11.2s | Blue
[03] ok 17.9s | "Buenos días"
...
[28] ok  6.2s | Jupiter
[29] ok 12.6s | (long structured response)
[30] ok 12.8s | (Python framework list)

PASS=30 FAIL=0 GARBAGE=0
CUDA_ERROR_ILLEGAL_ADDRESS in engine logs: 0

Zero CUDA errors across 30 sequential generations. The original retirement symptom didn’t come back.

Rolling it out across the fleet

GLM is back in the rotation. The same patch applies to every model we run that uses sleep/wake — MiMo-V2.5-Flash, DeepSeek-V4-Flash, and the AB1 GPU 5 cohort (Q3.6-27B, Gemma-4-26B, Nemotron-Omni, Qwen3-VL-30B, Q3.6-35B-A3B). Same pattern across the board: extract the allocator file from each image’s vLLM install path, apply the one-line patch, stage as a Nomad bind-mount overlay, walk the rotation one peer at a time. Rolling restart, no downtime. This is the third upstream-overlay we’ve rolled out this way on the consumer-Blackwell timeline.

While validating #43020 we also finished a separate contribution that fixes a different class of failure — Triton kernels crashing at startup on consumer Blackwell because they assume more shared memory than the GPU has.

The problem. vLLM ships custom GPU kernels written in Triton. Many of them have configuration parameters tuned for H100 / H200 (228 KB of shared memory per kernel block). On smaller GPUs — older Turing cards (64 KB), Ampere A100 (164 KB), consumer Blackwell (99 KB) — the larger configurations raise OutOfResources errors at startup and kill the worker mid-load.

The current workaround. A handful of vLLM kernels have hardcoded bucket switches:

# Current upstream
BKV_LIST = [64, 128] if check_shared_mem() else [32, 64]

That works for the worst-case kernel but it’s binary, runs once at import, and isn’t always enough. On consumer Blackwell at least one specific kernel configuration still needs ~131 KB — over the ~101 KB budget even after the workaround picks the smaller bucket.

Our helper. We wrote a small utility at vllm/triton_utils/shmem_budget.py that does the per-kernel math correctly:

infer_shmem_budget(device) — reads the actual per-block shared-memory budget from torch.cuda.get_device_properties. Cached per device.
make_shmem_pruner(estimate_shmem_bytes, ...) — returns a function Triton’s autotuner can use to drop any configuration that wouldn’t fit, with a safe fallback to the smallest config (and a warning) if nothing fits.

Both wire into Triton’s existing extension hooks. Zero changes to Triton itself. Zero changes to anyone else’s kernels. Zero change to H100 / H200 — every configuration that fits the bigger GPU stays in the rotation.

We wired the helper into two reference kernels: one Mamba-family chunk kernel where we originally hit the OOM, and one chunked-output kernel that already had a hardcoded workaround but still wasn’t sufficient. Twenty-one unit tests cover the helper across H100, A100, and consumer-Blackwell memory budgets.

The PR is submitted to vLLM as we publish this post; link at the bottom.

What’s still open upstream

Four upstream PRs each carry a piece of the consumer-Blackwell sleep/wake story. All four are open as of publication; we carry the patches as bind-mount overlays in the meantime.

PR	What it does	Our state
vllm#34600	Wake-up partial-map rollback	Already in our bundles
vllm#43020	Allocator stream-aware free (this post)	Validated, overlay live
vllm#41602	Hybrid Mamba/DeltaNet wake-up	Already in cu129-nightly
vllm#42856	Consumer-Blackwell sparse-attention workspace bounds	DSv4-only; not in the hot path
vllm#43047	Our Triton shared-memory pruner	Submitted

One bug isn’t fixed by any of these: Qwen3-Next-80B-A3B-Thinking’s deeper sleep mode returns HTTP 200 without actually releasing weights. That’s an architectural problem in how the hybrid DeltaNet plus sliding-window-attention release path fires, not an allocator race. Separate work, separate post.

Reproducing the overlay

The bind-mount pattern is small enough to drop into any vLLM Nomad spec:

volumes = [
  # Bake every vLLM patch we carry into the live container
  "/path/to/cumem-43020-patched.py:/usr/local/lib/python3.12/dist-packages/vllm/device_allocator/cumem.py:ro",
  # /root/.cache must be writable for FlashInfer JIT on readonly-rootfs containers
  "/tmp/<job-name>-rootcache:/root/.cache:rw",
]

The patched file is the head of vllm#43020 with our minimal port — just the torch.cuda.synchronize() line. We don’t drop in the full upstream file because there’s enough vLLM-version skew that a full swap breaks FlashInfer imports on our cu129-nightly base. The one-line surgical patch is the safer overlay.

Bottom line

A model we had retired four months ago is back in production. The fix was a single line of Python upstream — wait for pending GPU work to finish before tearing down the memory it was using. While we were testing, we also wrote a small Triton helper that prevents a related class of consumer-Blackwell kernel crashes, and we’re sending it upstream.

If you’re running vLLM on consumer Blackwell hardware and seeing intermittent CUDA errors after a few sleep/wake cycles, vllm#43020 is probably the fix you need. The bind-mount overlay pattern above works without rebuilding your container.