1-nvidia, DGX OS 7 (Ubuntu 24.04), running the default GNOME desktop session (Xorg + gnome-shell + Firefox open, ~350-650MB GPU memory in use by the desktop throughout).
The issue: vLLM consistently hangs right after attention/MoE backend selection, before model weight loading meaningfully progresses. No errors, no crash — just silence. Reproduced across three independently-built images and multiple models/quantization formats — including NVIDIA’s own official “agent-ready Qwen3.6-35B” recipe from build.nvidia.com, followed verbatim.
The hang signature (identical pattern every time):
Last log line is always some variant of “Using [FLASHINFER/TRITON_ATTN/FLASH_ATTN] attention backend out of potential backends: […]”
EngineCore process shows in nvidia-smi with memory allocated (varies 18.9GB–66.7GB depending on attempt) but then goes static
GPU-Util sits at 0% (one attempt briefly showed 1%)
Process has 40-45 sleeping threads, near-zero cumulative CPU time across all of them
/proc/[pid]/io shows zero read_bytes, static rchar/wchar — no IO activity
Port never binds — curl to /health and /v1/models both fail to connect
Attempts (all hit the same pattern):
eugr/spark-vllm-docker + Nemotron-3-Nano-30B-A3B-NVFP4, default flags — hung 40+ min
Same model + enforce-eager — same hang point
vllm/vllm-openai:nightly-aarch64 — this is NVIDIA’s own official agent-ready Qwen3.6-35B-A3B-NVFP4 recipe from vLLM for Inference | DGX Spark , followed exactly as documented. Also tried with moe-backend=marlin, VLLM_USE_FLASHINFER_MOE_FP4=1, VLLM_FLASHINFER_MOE_BACKEND=latency, --ipc=host, enforce-eager, both FLASHINFER and TRITON_ATTN attention backends — all hung at the same point (full 10-min test on TRITON_ATTN). If NVIDIA’s own published recipe for this exact hardware doesn’t run out of the box, that seems worth flagging on its own.
eugr/spark-vllm-docker + QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ (different quant format entirely) — got further: past “Using FlashAttention version 2” and “Using MARLIN WNA16 MoE backend”, weights loading to 18.7GB, then hung on CUDA graph compile (no enforce-eager)
Same model + enforce-eager — hung at the same point but BEFORE weight loading started this time
Package vllm-spark · GitHub + Qwen3.6-35B-A3B-NVFP4 + enforce-eager — got furthest yet: 66.7GB loaded into UMA, GPU-Util briefly showed 1% (vs flat 0% on all other attempts), then also went static with no further log output
What I’ve ruled out:
Not a single-image build issue — 3 independent image lineages, all affected, including NVIDIA’s own
Not NVFP4-specific — AWQ quantization also hangs (different point, but still hangs)
Not purely a CUDA graph compilation issue — enforce-eager doesn’t consistently help (helped on attempt 1, didn’t help on attempts 4→5 or the bjk110 attempt)
Not a shared-memory/IPC issue — --ipc=host made no difference
Not the GB10 hardware/driver itself — Ollama works perfectly, GPU-accelerated, confirmed via nvidia-smi during generation (qwen3:8b, real tok/s, GPU memory + utilization both behaving normally)
Open question / next test: all of the above was run with the default GNOME desktop session active (Xorg + gnome-shell + Firefox, ~350-650MB GPU memory consumed by the desktop throughout every attempt). I haven’t yet tried headless (multi-user.target, no GUI) — planning to test that tomorrow and will report back either way. If anyone has tried vLLM on GB10 headless vs. with the desktop active and seen a difference, that’d be a huge help to know before I burn more time on it.
Happy to provide full logs, nvidia-smi output, /proc/[pid]/status, strace, or anything else that would help diagnose. Fresh-out-of-box unit, so if there’s a known fix I’m glad to be the test case for others hitting this.
Update: Qwen3.6-35B-A3B-NVFP4 serving real inference. Full resolution below.
Summary of Issues Found
Issue 1 (FlashInfer JIT hang): Any Docker image where VLLM_HAS_FLASHINFER_CUBIN = False will deadlock silently when FlashInfer is selected as a backend. SM12.1 (GB10) has no pre-compiled FlashInfer cubins in the images I originally tested (eugr/spark-vllm-docker at the time, vllm/vllm-openai:nightly-aarch64, Package vllm-spark · GitHub). JIT compilation on SM12.1 deadlocks. Fix in old images: force Triton backends via --attention-config '{"backend": "TRITON_ATTN"}' + VLLM_USE_FLASHINFER_SAMPLER=0.
Issue 2 (NVFP4 weight loader KeyError): On Package vllm-spark · GitHub , serving nvidia/Qwen3.6-35B-A3B-NVFP4 crashed with KeyError: 'layers.0.mlp.experts.w2_input_scale'. The checkpoint uses model.language_model.layers.X.mlp.experts.Y.{gate_proj,up_proj,down_proj}.{weight,input_scale,weight_scale,weight_scale_2} (VLM-wrapped, per-expert per-projection). The bjk110 image’s vLLM weight loader expected a different fused key format.
Final Working Solution
Build the eugr image with today’s (2026-06-14) prebuilt wheels, which include:
vLLM 0.22.1rc1.dev511+gc621af169.d20260614 — fixes the NVFP4 weight loader
FlashInfer 0.6.13 with pre-compiled SM12.1 cubins — no more JIT hang, can use FlashInfer natively
git clone https://github.com/eugr/spark-vllm-docker.git
cd spark-vllm-docker
./build-and-copy.sh --tf5
--tf5 is required: the model’s config declares transformers_version: 5.7.0.dev0.
Build takes ~4 minutes (downloads prebuilt wheels, no compilation).
Real tokens, no hang, system_fingerprint confirms vLLM version.
Key Diagnostics for Others
python3 -c "import vllm.envs as e; print(e.VLLM_HAS_FLASHINFER_CUBIN)" — if False, your image needs either precompiled cubins or Triton fallback.
If hitting KeyError: 'layers.X.mlp.experts.w2_input_scale' loading NVFP4: use today’s eugr prebuilt-vllm-current wheel (0.22.1rc1.dev511+). The bjk110 image’s weight loader didn’t handle the VLM-wrapped per-expert per-projection scale format.
Corrupt HF cache: blobs with .incomplete suffix = partial downloads. Delete and re-download with HF_TOKEN (nvidia/* models are gated).
VLLM_ATTENTION_BACKEND env var does not exist in older vLLM versions — override via --attention-config or --attention-backend depending on version.
Full Debug Timeline
Root Cause — CONFIRMED
FlashInfer has no pre-compiled cubins for SM12.1 (VLLM_HAS_FLASHINFER_CUBIN = False).
When any FlashInfer component is selected as a backend, it falls back to JIT CUDA kernel compilation.
JIT compilation on SM12.1 deadlocks silently. The process allocates GPU memory, then hangs with 40–160+ sleeping threads, wchan=futex_do_wait, zero GPU utilization, zero disk IO.
The hang appears immediately after Using FLASHINFER attention backend because attention backend init triggers the JIT path first. FlashInfer is also selected for MoE (FlashInfer CUTLASS) and FP8 linear (FlashInferFP8ScaledMMLinearKernel) — those would also hang if reached.
Secondary cause: All HuggingFace model caches were corrupt (incomplete downloads, all .incomplete blobs). This was masked because FlashInfer was hanging before weight loading was ever attempted.
Model: nvidia/Qwen3.6-35B-A3B-NVFP4 Result: ❌ HUNG — headless made no difference. Last log line: Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'TRITON_ATTN']. Key debug finding:
DEBUG: {FLASH_ATTN: [kv_cache_dtype not supported], FLEX_ATTENTION: [kv_cache_dtype not supported],
TURBOQUANT: [kv_cache_dtype not supported]}
--kv-cache-dtype fp8 eliminates all other backends → FlashInfer selected → JIT hang.
Result: ❌ HUNG — VLLM_ATTENTION_BACKEND does not exist in vLLM v0.21.0. Silently ignored. Finding: Inspected vllm.envs module — no such variable. Override must be via --attention-config CLI arg.
Result: ✅ PAST HANG — both backends switched to Triton, weight loading began.
Then stalled: EngineCore opened .incomplete HF blob files — downloading unauthenticated from gated model. Discovery: All 4 cached models had fully incomplete weight files. Cache cleaned with find ... -name "*.incomplete" -delete.
Model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 with --load-format dummy Result: ✅ Server up, port 8000 responding, curl /v1/models → 200 OK.
Confirmed vLLM stack is fully functional with Triton backends.
Attempt 5 — Real weights: Nemotron FP8 (31GB, authenticated download)
Downloaded nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 via container with HF_TOKEN. Result: ✅ FULL INFERENCE WORKING
INFO Application startup complete.
curl /v1/chat/completions → real tokens returned
Attempt 6 — Qwen3.6-35B-A3B-NVFP4 on bjk110 image (KeyError)
curl http://localhost:8000/v1/models
→ {"data":[{"id":"nvidia/Qwen3.6-35B-A3B-NVFP4","max_model_len":32768,...}]}
curl http://localhost:8000/v1/chat/completions -d '{"model":"nvidia/Qwen3.6-35B-A3B-NVFP4","messages":[{"role":"user","content":"Say hello in one word"}],"max_tokens":10}'
→ real tokens, system_fingerprint: vllm-0.22.1rc1.dev511+gc621af169.d20260614
Why it worked: New vLLM 0.22.1rc1 fixes the weight loader for Qwen3_5MoeForConditionalGeneration (NVFP4 VLM format). FlashInfer 0.6.13 includes pre-compiled SM12.1 cubins — no more JIT hang, attention_backend: FLASHINFER works natively.