"Qwen3.6-35B-A3B-NVFP4 hangs after attention backend selection across 3 vLLM images, including NVIDIA's own official recipe

dan380 · June 14, 2026, 12:39am

Posting this in case it’s useful to others or someone recognizes the pattern — happy to provide more diagnostics if helpful.

Hardware/Software: DGX Spark (Founders Edition), driver 580.159.03, CUDA 13.0.2, kernel 6.17.0-102

nvidia-bug-report.log.gz (405.2 KB)

1-nvidia, DGX OS 7 (Ubuntu 24.04), running the default GNOME desktop session (Xorg + gnome-shell + Firefox open, ~350-650MB GPU memory in use by the desktop throughout).

The issue: vLLM consistently hangs right after attention/MoE backend selection, before model weight loading meaningfully progresses. No errors, no crash — just silence. Reproduced across three independently-built images and multiple models/quantization formats — including NVIDIA’s own official “agent-ready Qwen3.6-35B” recipe from build.nvidia.com, followed verbatim.

The hang signature (identical pattern every time):

Last log line is always some variant of “Using [FLASHINFER/TRITON_ATTN/FLASH_ATTN] attention backend out of potential backends: […]”

EngineCore process shows in nvidia-smi with memory allocated (varies 18.9GB–66.7GB depending on attempt) but then goes static

GPU-Util sits at 0% (one attempt briefly showed 1%)

Process has 40-45 sleeping threads, near-zero cumulative CPU time across all of them

/proc/[pid]/io shows zero read_bytes, static rchar/wchar — no IO activity

Port never binds — curl to /health and /v1/models both fail to connect

Attempts (all hit the same pattern):

eugr/spark-vllm-docker + Nemotron-3-Nano-30B-A3B-NVFP4, default flags — hung 40+ min

Same model + enforce-eager — same hang point

vllm/vllm-openai:nightly-aarch64 — this is NVIDIA’s own official agent-ready Qwen3.6-35B-A3B-NVFP4 recipe from build.nvidia.com/spark/vllm/agent-ready-qwen35b , followed exactly as documented. Also tried with moe-backend=marlin, VLLM_USE_FLASHINFER_MOE_FP4=1, VLLM_FLASHINFER_MOE_BACKEND=latency, --ipc=host, enforce-eager, both FLASHINFER and TRITON_ATTN attention backends — all hung at the same point (full 10-min test on TRITON_ATTN). If NVIDIA’s own published recipe for this exact hardware doesn’t run out of the box, that seems worth flagging on its own.

eugr/spark-vllm-docker + QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ (different quant format entirely) — got further: past “Using FlashAttention version 2” and “Using MARLIN WNA16 MoE backend”, weights loading to 18.7GB, then hung on CUDA graph compile (no enforce-eager)

Same model + enforce-eager — hung at the same point but BEFORE weight loading started this time

Package vllm-spark · GitHub + Qwen3.6-35B-A3B-NVFP4 + enforce-eager — got furthest yet: 66.7GB loaded into UMA, GPU-Util briefly showed 1% (vs flat 0% on all other attempts), then also went static with no further log output

What I’ve ruled out:

Not a single-image build issue — 3 independent image lineages, all affected, including NVIDIA’s own

Not NVFP4-specific — AWQ quantization also hangs (different point, but still hangs)

Not purely a CUDA graph compilation issue — enforce-eager doesn’t consistently help (helped on attempt 1, didn’t help on attempts 4→5 or the bjk110 attempt)

Not a shared-memory/IPC issue — --ipc=host made no difference

Not the GB10 hardware/driver itself — Ollama works perfectly, GPU-accelerated, confirmed via nvidia-smi during generation (qwen3:8b, real tok/s, GPU memory + utilization both behaving normally)

Open question / next test: all of the above was run with the default GNOME desktop session active (Xorg + gnome-shell + Firefox, ~350-650MB GPU memory consumed by the desktop throughout every attempt). I haven’t yet tried headless (multi-user.target, no GUI) — planning to test that tomorrow and will report back either way. If anyone has tried vLLM on GB10 headless vs. with the desktop active and seen a difference, that’d be a huge help to know before I burn more time on it.

Happy to provide full logs, nvidia-smi output, /proc/[pid]/status, strace, or anything else that would help diagnose. Fresh-out-of-box unit, so if there’s a known fix I’m glad to be the test case for others hitting this.

ludbzh · June 14, 2026, 6:20pm

Stop fighting look for a thread on this forum a guys runes 2 qwen 3,6 MoE hermès version, on one spark he use llama becaise vllm is bug at this time

dan380 · June 14, 2026, 10:09pm

Update
vLLM DGX Spark Debug Log

System: NVIDIA GB10 (SM_12.1), Driver 580.159.03, CUDA 13.0.2, DGX OS 7 (Ubuntu 24.04), aarch64
RAM: 121GB unified (GPU+CPU shared pool) | Disk: 3.7TB
Session: 2026-06-14
Status: ✅ FULLY RESOLVED — Qwen3.6-35B-A3B-NVFP4 serving real inference on port 8000

Forum Update Post — Final Resolution (2026-06-14)

Thread: "Qwen3.6-35B-A3B-NVFP4 hangs after attention backend selection across 3 vLLM images, including NVIDIA's own official recipe

Update: Qwen3.6-35B-A3B-NVFP4 serving real inference. Full resolution below.

Summary of Issues Found

Issue 1 (FlashInfer JIT hang): Any Docker image where VLLM_HAS_FLASHINFER_CUBIN = False will deadlock silently when FlashInfer is selected as a backend. SM12.1 (GB10) has no pre-compiled FlashInfer cubins in the images I originally tested (eugr/spark-vllm-docker at the time, vllm/vllm-openai:nightly-aarch64, Package vllm-spark · GitHub). JIT compilation on SM12.1 deadlocks. Fix in old images: force Triton backends via --attention-config '{"backend": "TRITON_ATTN"}' + VLLM_USE_FLASHINFER_SAMPLER=0.

Issue 2 (NVFP4 weight loader KeyError): On Package vllm-spark · GitHub , serving nvidia/Qwen3.6-35B-A3B-NVFP4 crashed with KeyError: 'layers.0.mlp.experts.w2_input_scale'. The checkpoint uses model.language_model.layers.X.mlp.experts.Y.{gate_proj,up_proj,down_proj}.{weight,input_scale,weight_scale,weight_scale_2} (VLM-wrapped, per-expert per-projection). The bjk110 image’s vLLM weight loader expected a different fused key format.

Final Working Solution

Build the eugr image with today’s (2026-06-14) prebuilt wheels, which include:

vLLM 0.22.1rc1.dev511+gc621af169.d20260614 — fixes the NVFP4 weight loader
FlashInfer 0.6.13 with pre-compiled SM12.1 cubins — no more JIT hang, can use FlashInfer natively

git clone https://github.com/eugr/spark-vllm-docker.git
cd spark-vllm-docker
./build-and-copy.sh --tf5

--tf5 is required: the model’s config declares transformers_version: 5.7.0.dev0.

Build takes ~4 minutes (downloads prebuilt wheels, no compilation).

Then serve:

docker run -d --gpus all --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --name vllm_qwen36 \
  -e HF_TOKEN=<your_hf_token> \
  -e VLLM_USE_FLASHINFER_SAMPLER=0 \
  -e VLLM_LOGGING_LEVEL=INFO \
  vllm-node-tf5:latest \
  vllm serve nvidia/Qwen3.6-35B-A3B-NVFP4 \
    --host 0.0.0.0 --port 8000 \
    --tensor-parallel-size 1 \
    --trust-remote-code \
    --gpu-memory-utilization 0.85 \
    --max-model-len 32768 \
    --enforce-eager \
    --kv-cache-dtype fp8 \
    --attention-backend FLASHINFER

Verified

$ curl http://localhost:8000/v1/models
{"object":"list","data":[{"id":"nvidia/Qwen3.6-35B-A3B-NVFP4",...,"max_model_len":32768}]}

$ curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"nvidia/Qwen3.6-35B-A3B-NVFP4",
       "messages":[{"role":"user","content":"Say hello in one word"}],
       "max_tokens":10}'
# → {"choices":[{"message":{"content":"Thinking Process:\n1.  **Analyze"...}}],"system_fingerprint":"vllm-0.22.1rc1.dev511+gc621af169.d20260614-..."}

Real tokens, no hang, system_fingerprint confirms vLLM version.

Key Diagnostics for Others

python3 -c "import vllm.envs as e; print(e.VLLM_HAS_FLASHINFER_CUBIN)" — if False, your image needs either precompiled cubins or Triton fallback.
If hitting KeyError: 'layers.X.mlp.experts.w2_input_scale' loading NVFP4: use today’s eugr prebuilt-vllm-current wheel (0.22.1rc1.dev511+). The bjk110 image’s weight loader didn’t handle the VLM-wrapped per-expert per-projection scale format.
Corrupt HF cache: blobs with .incomplete suffix = partial downloads. Delete and re-download with HF_TOKEN (nvidia/* models are gated).
VLLM_ATTENTION_BACKEND env var does not exist in older vLLM versions — override via --attention-config or --attention-backend depending on version.

Full Debug Timeline

Root Cause — CONFIRMED

FlashInfer has no pre-compiled cubins for SM12.1 (VLLM_HAS_FLASHINFER_CUBIN = False).
When any FlashInfer component is selected as a backend, it falls back to JIT CUDA kernel compilation.
JIT compilation on SM12.1 deadlocks silently. The process allocates GPU memory, then hangs with 40–160+ sleeping threads, wchan=futex_do_wait, zero GPU utilization, zero disk IO.

The hang appears immediately after Using FLASHINFER attention backend because attention backend init triggers the JIT path first. FlashInfer is also selected for MoE (FlashInfer CUTLASS) and FP8 linear (FlashInferFP8ScaledMMLinearKernel) — those would also hang if reached.

Secondary cause: All HuggingFace model caches were corrupt (incomplete downloads, all .incomplete blobs). This was masked because FlashInfer was hanging before weight loading was ever attempted.

Attempt 1 — Headless baseline + NCCL_DEBUG + VLLM_LOGGING_LEVEL=DEBUG

Model: nvidia/Qwen3.6-35B-A3B-NVFP4
Result: ❌ HUNG — headless made no difference.
Last log line: Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'TRITON_ATTN'].
Key debug finding:

DEBUG: {FLASH_ATTN: [kv_cache_dtype not supported], FLEX_ATTENTION: [kv_cache_dtype not supported],
        TURBOQUANT: [kv_cache_dtype not supported]}

--kv-cache-dtype fp8 eliminates all other backends → FlashInfer selected → JIT hang.

Attempt 2 — VLLM_ATTENTION_BACKEND=TRITON_ATTN (wrong approach)

Result: ❌ HUNG — VLLM_ATTENTION_BACKEND does not exist in vLLM v0.21.0. Silently ignored.
Finding: Inspected vllm.envs module — no such variable. Override must be via --attention-config CLI arg.

Attempt 3 — --attention-config + --kernel-config (correct approach)

--attention-config '{"backend": "TRITON_ATTN"}' --kernel-config '{"moe_backend": "triton"}'
-e VLLM_USE_FLASHINFER_SAMPLER=0

Result: ✅ PAST HANG — both backends switched to Triton, weight loading began.
Then stalled: EngineCore opened .incomplete HF blob files — downloading unauthenticated from gated model.
Discovery: All 4 cached models had fully incomplete weight files. Cache cleaned with find ... -name "*.incomplete" -delete.

Attempt 4 — TRITON_ATTN + --load-format dummy (stack validation)

Model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 with --load-format dummy
Result: ✅ Server up, port 8000 responding, curl /v1/models → 200 OK.
Confirmed vLLM stack is fully functional with Triton backends.

Attempt 5 — Real weights: Nemotron FP8 (31GB, authenticated download)

Downloaded nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 via container with HF_TOKEN.
Result: ✅ FULL INFERENCE WORKING

INFO Application startup complete.
curl /v1/chat/completions → real tokens returned

Attempt 6 — Qwen3.6-35B-A3B-NVFP4 on bjk110 image (KeyError)

Image: ghcr.io/bjk110/vllm-spark:v022-d568
Config: --attention-config '{"backend": "TRITON_ATTN"}', kv_cache_dtype: fp8
Result: ❌ KeyError: 'layers.0.mlp.experts.w2_input_scale' at qwen3_5.py:393
Root cause: bjk110 image vLLM weight loader expected fused key format; NVFP4 checkpoint uses VLM-wrapped per-expert per-projection scales.
Checkpoint structure: model.language_model.layers.X.mlp.experts.Y.{gate_proj,up_proj,down_proj}.{weight,input_scale,weight_scale,weight_scale_2} — 3 shards (10GB + 10GB + 3GB), architecture Qwen3_5MoeForConditionalGeneration.

Attempt 7 — eugr vllm-node-tf5 image (FINAL — SUCCESS)

Trigger: eugr_nv (NVIDIA moderator) pushed prebuilt-vllm-current release 2026-06-14 with vLLM 0.22.1rc1.dev511.

Image built:

git clone https://github.com/eugr/spark-vllm-docker.git
./build-and-copy.sh --tf5   # 4 min, downloads prebuilt wheels
# → vllm-node-tf5:latest

Config (vllm_qwen36_config.yaml):

model: nvidia/Qwen3.6-35B-A3B-NVFP4
host: 0.0.0.0
port: 8000
tensor_parallel_size: 1
trust_remote_code: true
gpu_memory_utilization: 0.85
max_model_len: 32768
enforce_eager: true
kv_cache_dtype: fp8
attention_backend: FLASHINFER

Result: ✅ FULL INFERENCE WORKING

curl http://localhost:8000/v1/models
→ {"data":[{"id":"nvidia/Qwen3.6-35B-A3B-NVFP4","max_model_len":32768,...}]}

curl http://localhost:8000/v1/chat/completions -d '{"model":"nvidia/Qwen3.6-35B-A3B-NVFP4","messages":[{"role":"user","content":"Say hello in one word"}],"max_tokens":10}'
→ real tokens, system_fingerprint: vllm-0.22.1rc1.dev511+gc621af169.d20260614

Why it worked: New vLLM 0.22.1rc1 fixes the weight loader for Qwen3_5MoeForConditionalGeneration (NVFP4 VLM format). FlashInfer 0.6.13 includes pre-compiled SM12.1 cubins — no more JIT hang, attention_backend: FLASHINFER works natively.

Thank you @eugr_nv for the quick turnaround.

Topic		Replies	Views
New NGC vLLM container image (vllm:26.01-py3) DGX Spark / GB10 cudnn , dali	7	1621	May 3, 2026
Qwen3.5-397B-A17B + DGX Spark (duo) DGX Spark / GB10 Projects	62	6645	June 14, 2026
VLLM -- the $150M train wreck? DGX Spark / GB10 llama	24	1586	February 27, 2026
Nemotron-3-Nano-30B-A3B-NVFP4 ultra-efficient NVFP4 precision version of Nemotron 3 Nano DGX Spark / GB10 jetson , nemotron	84	3539	March 20, 2026
vLLM containers DGX Spark / GB10	45	2619	July 22, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	32	3365	December 17, 2025
Some new development work for Qwen3 on the Spark DGX Spark / GB10	5	863	February 3, 2026
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	234	13875	May 15, 2026
From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f DGX Spark / GB10	10	1832	January 7, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	144	9519	March 14, 2026