Looks like the fix is in: fix: Add SM120 (RTX Blackwell) support for FlashInfer CUTLASS NVFP4 MoE kernels by renehonig · Pull Request #33417 · vllm-project/vllm · GitHub - will test soon, hopefully it works with sm121 too, not just sm120.
Doesn’t the Spark DGX OS already have the NVIDIA Container Toolkit installed?
so then set the nvidia runtime as the default in the Docker daemon configuration file, if you already installed it.
New versions of Docker just use --gpus=all instead
Well, it works, but flashinfer spits out a lot of errors and eventually it crashes. There was also something weird in the response that causes llama-benchy to ignore anything below 8192 token context. I’ll troubleshoot tomorrow if I have time.
I released a new version of llama-benchy, but the model crashed when benchmarking :)
| model | test | t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 | pp2048 | 10913.72 ± 57.47 | 244.22 ± 0.99 | 187.66 ± 0.99 | 244.41 ± 0.91 |
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 | tg128 | 57.34 ± 0.12 | |||
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 | ctx_pp @ d4096 | 6555.06 ± 4272.06 | 3000.39 ± 3558.27 | 2943.83 ± 3558.27 | 3000.54 ± 3558.30 |
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 | ctx_tg @ d4096 | 57.22 ± 0.17 | |||
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 | pp2048 @ d4096 | 2687.50 ± 509.25 | 851.56 ± 173.92 | 795.00 ± 173.92 | 851.71 ± 173.94 |
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 | tg128 @ d4096 | 57.14 ± 0.30 | |||
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 | ctx_pp @ d8192 | 8787.21 ± 68.75 | 988.84 ± 7.26 | 932.28 ± 7.26 | 988.93 ± 7.26 |
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 | ctx_tg @ d8192 | 56.87 ± 0.38 | |||
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 | pp2048 @ d8192 | 1710.83 ± 23.56 | 1253.87 ± 16.65 | 1197.31 ± 16.65 | 1253.96 ± 16.66 |
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 | tg128 @ d8192 | 53.24 ± 5.66 |
llama-benchy (0.1.2)
date: 2026-01-31 23:58:33 | latency mode: generation
I did a AWQ quant with llm-compressor for comparison and gave llama-benchy some runs:
| model | test | t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|
| stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ | pp2048 | 6760.90 ± 14.10 | 345.92 ± 0.63 | 302.92 ± 0.63 | 346.00 ± 0.63 |
| stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ | tg128 | 82.58 ± 0.40 | |||
| stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ | ctx_pp @ d4096 | 6409.33 ± 16.35 | 682.02 ± 1.58 | 639.02 ± 1.58 | 682.08 ± 1.56 |
| stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ | ctx_tg @ d4096 | 82.51 ± 0.10 | |||
| stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ | pp2048 @ d4096 | 2077.74 ± 0.46 | 1028.68 ± 0.22 | 985.69 ± 0.22 | 1028.76 ± 0.23 |
| stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ | tg128 @ d4096 | 82.01 ± 0.25 | |||
| stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ | ctx_pp @ d8192 | 6100.29 ± 5.13 | 1385.88 ± 1.13 | 1342.89 ± 1.13 | 1385.96 ± 1.14 |
| stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ | ctx_tg @ d8192 | 81.96 ± 0.25 | |||
| stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ | pp2048 @ d8192 | 1204.08 ± 3.74 | 1743.89 ± 5.29 | 1700.90 ± 5.29 | 1743.97 ± 5.29 |
| stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ | tg128 @ d8192 | 81.98 ± 0.11 | |||
| stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ | ctx_pp @ d16384 | 5837.62 ± 8.06 | 2849.62 ± 3.88 | 2806.63 ± 3.88 | 2849.71 ± 3.88 |
| stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ | ctx_tg @ d16384 | 81.50 ± 0.03 | |||
| stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ | pp2048 @ d16384 | 640.88 ± 1.43 | 3238.61 ± 7.12 | 3195.61 ± 7.12 | 3238.69 ± 7.12 |
| stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ | tg128 @ d16384 | 81.11 ± 0.28 | |||
| stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ | ctx_pp @ d32768 | 5420.10 ± 6.76 | 6088.65 ± 7.53 | 6045.65 ± 7.53 | 6088.74 ± 7.54 |
| stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ | ctx_tg @ d32768 | 79.91 ± 0.11 | |||
| stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ | pp2048 @ d32768 | 316.11 ± 0.15 | 6521.68 ± 3.04 | 6478.69 ± 3.04 | 6521.76 ± 3.04 |
| stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ | tg128 @ d32768 | 79.53 ± 0.38 | |||
| stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ | ctx_pp @ d65536 | 4793.14 ± 8.89 | 13715.91 ± 25.39 | 13672.91 ± 25.39 | 13715.99 ± 25.38 |
| stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ | ctx_tg @ d65536 | 77.47 ± 0.10 | |||
| stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ | pp2048 @ d65536 | 144.01 ± 0.20 | 14263.78 ± 20.02 | 14220.78 ± 20.02 | 14263.89 ± 20.03 |
| stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ | tg128 @ d65536 | 76.96 ± 0.23 | |||
| stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ | ctx_pp @ d131072 | 3857.39 ± 4.53 | 34022.50 ± 39.85 | 33979.51 ± 39.85 | 34022.58 ± 39.84 |
| stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ | ctx_tg @ d131072 | 72.80 ± 0.05 | |||
| stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ | pp2048 @ d131072 | 58.90 ± 0.06 | 34811.10 ± 35.76 | 34768.10 ± 35.76 | 34811.19 ± 35.79 |
| stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ | tg128 @ d131072 | 72.74 ± 0.23 |
llama-benchy (0.1.2)
date: 2026-02-01 11:18:56 | latency mode: generation
Using eugr’s vLLM docker brew scripts (–use-wheels --pre-tf) and 0.16.0rc1.dev81+g672023877.cu13.
Quant over here:
When you say model crashed, are you seeing this in dmesg and getting 500 errors in vllm?
[ 3886.179778] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 1, SM 0): Illegal Instruction Parameter
[ 3886.179787] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Global Exception on (GPC 1, TPC 1, SM 0): Multiple Warp Errors
[ 3886.179792] NVRM: Xid (PCI:000f:01:00): 13, Graphics Exception: ESR 0x516730=0x2000b 0x516734=0x24 0x516728=0x1c81fb60 0x51672c=0x1174
[ 3886.179796] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 1, SM 1): Illegal Instruction Parameter
[ 3886.179965] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Global Exception on (GPC 1, TPC 1, SM 1): Multiple Warp Errors
[ 3886.180352] NVRM: Xid (PCI:000f:01:00): 13, Graphics Exception: ESR 0x5167b0=0x1000b 0x5167b4=0x24 0x5167a8=0x1c81fb60 0x5167ac=0x1174
[ 3886.180357] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 3, SM 0): Illegal Instruction Parameter
[ 3886.180451] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Global Exception on (GPC 1, TPC 3, SM 0): Multiple Warp Errors
[ 3886.180531] NVRM: Xid (PCI:000f:01:00): 13, Graphics Exception: ESR 0x518730=0x1000b 0x518734=0x24 0x518728=0x1c81fb60 0x51872c=0x1174
[ 3886.180790] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 3, SM 1): Illegal Instruction Parameter
[ 3886.180889] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Global Exception on (GPC 1, TPC 3, SM 1): Multiple Warp Errors
[ 3886.180979] NVRM: Xid (PCI:000f:01:00): 13, Graphics Exception: ESR 0x5187b0=0x1000b 0x5187b4=0x24 0x5187a8=0x1c81fb60 0x5187ac=0x1174
[ 3886.181289] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 5, SM 0): Illegal Instruction Parameter
[ 3886.181388] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Global Exception on (GPC 1, TPC 5, SM 0): Multiple Warp Errors
[ 3886.181473] NVRM: Xid (PCI:000f:01:00): 13, Graphics Exception: ESR 0x51a730=0x2000b 0x51a734=0x24 0x51a728=0x1c81fb60 0x51a72c=0x1174
[ 3886.181711] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 5, SM 1): Illegal Instruction Parameter
[ 3886.181814] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Global Exception on (GPC 1, TPC 5, SM 1): Multiple Warp Errors
[ 3886.181891] NVRM: Xid (PCI:000f:01:00): 13, Graphics Exception: ESR 0x51a7b0=0x2000b 0x51a7b4=0x24 0x51a7a8=0x1c81fb60 0x51a7ac=0x1174
[ 3886.182224] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 1, SM 0): Illegal Instruction Parameter
[ 3886.182319] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Global Exception on (GPC 2, TPC 1, SM 0): Multiple Warp Errors
[ 3886.182397] NVRM: Xid (PCI:000f:01:00): 13, Graphics Exception: ESR 0x526730=0x3000b 0x526734=0x24 0x526728=0x1c81fb60 0x52672c=0x1174
[ 3886.182669] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 1, SM 1): Illegal Instruction Parameter
[ 3886.182767] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Global Exception on (GPC 2, TPC 1, SM 1): Multiple Warp Errors
[ 3886.182845] NVRM: Xid (PCI:000f:01:00): 13, Graphics Exception: ESR 0x5267b0=0x3000b 0x5267b4=0x24 0x5267a8=0x1c81fb60 0x5267ac=0x1174
[ 3886.183149] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 2, SM 1): Illegal Instruction Parameter
[ 3886.183246] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Global Exception on (GPC 2, TPC 2, SM 1): Multiple Warp Errors
[ 3886.183332] NVRM: Xid (PCI:000f:01:00): 13, Graphics Exception: ESR 0x5277b0=0x3000b 0x5277b4=0x24 0x5277a8=0x1c81fb60 0x5277ac=0x1174
[ 3886.183625] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 3, SM 0): Illegal Instruction Parameter
[ 3886.183719] NVRM: Xid (PCI:000f:01:00): 13, Graphics Exception: ESR 0x528730=0x2000b 0x528734=0x20 0x528728=0x1c81fb60 0x52872c=0x1174
[ 3886.183938] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 3, SM 1): Illegal Instruction Parameter
[ 3886.184042] NVRM: Xid (PCI:000f:01:00): 13, Graphics Exception: ESR 0x5287b0=0x2000b 0x5287b4=0x20 0x5287a8=0x1c81fb60 0x5287ac=0x1174
[ 3886.184333] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 5, SM 0): Illegal Instruction Parameter
[ 3886.184427] NVRM: Xid (PCI:000f:01:00): 13, Graphics Exception: ESR 0x52a730=0xb 0x52a734=0x20 0x52a728=0x1c81fb60 0x52a72c=0x1174
[ 3886.184677] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 5, SM 1): Illegal Instruction Parameter
[ 3886.184768] NVRM: Xid (PCI:000f:01:00): 13, Graphics Exception: ESR 0x52a7b0=0x3000b 0x52a7b4=0x20 0x52a7a8=0x1c81fb60 0x52a7ac=0x1174
[ 3886.185096] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 1, SM 0): Illegal Instruction Parameter
[ 3886.185189] NVRM: Xid (PCI:000f:01:00): 13, Graphics Exception: ESR 0x536730=0x1000b 0x536734=0x20 0x536728=0x1c81fb60 0x53672c=0x1174
[ 3886.185401] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 1, SM 1): Illegal Instruction Parameter
[ 3886.185498] NVRM: Xid (PCI:000f:01:00): 13, Graphics Exception: ESR 0x5367b0=0x3000b 0x5367b4=0x20 0x5367a8=0x1c81fb60 0x5367ac=0x1174
[ 3886.185791] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 2, SM 0): Illegal Instruction Parameter
[ 3886.185884] NVRM: Xid (PCI:000f:01:00): 13, Graphics Exception: ESR 0x537730=0x2000b 0x537734=0x20 0x537728=0x1c81fb60 0x53772c=0x1174
[ 3886.186156] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 3, SM 0): Illegal Instruction Parameter
[ 3886.186252] NVRM: Xid (PCI:000f:01:00): 13, Graphics Exception: ESR 0x538730=0x3000b 0x538734=0x20 0x538728=0x1c81fb60 0x53872c=0x1174
[ 3886.186484] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 3, SM 1): Illegal Instruction Parameter
[ 3886.186585] NVRM: Xid (PCI:000f:01:00): 13, Graphics Exception: ESR 0x5387b0=0x1000b 0x5387b4=0x20 0x5387a8=0x1c81fb60 0x5387ac=0x1174
[ 3886.186890] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 5, SM 0): Illegal Instruction Parameter
[ 3886.186984] NVRM: Xid (PCI:000f:01:00): 13, Graphics Exception: ESR 0x53a730=0x1000b 0x53a734=0x20 0x53a728=0x1c81fb60 0x53a72c=0x1174
[ 3886.187199] NVRM: Xid (PCI:000f:01:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 5, SM 1): Illegal Instruction Parameter
[ 3886.187292] NVRM: Xid (PCI:000f:01:00): 13, Graphics Exception: ESR 0x53a7b0=0xb 0x53a7b4=0x20 0x53a7a8=0x1c81fb60 0x53a7ac=0x1174
[ 3886.190317] NVRM: Xid (PCI:000f:01:00): 43, pid=20919, name=VLLM::EngineCor, channel 0x00000002
Yep, and also exceptions in vLLM log itself.
Yes, AWQ is still outperforming NVFP4 for inference. Also notice how much more consistency is there between runs.
🤔 where is this the case?…
See two benchmarks for Nemotron above. 83 t/s for AWQ (this is what I’d expect of 3B active params) vs 60 t/s for NVFP4.
But that’s with a standard vLLM build where FP4 pathways are not fully enabled for sm121.
I’ve been trying to get the Nemotron 3 Nano 30B A3B NVFP4 model running, with varying degrees of success.
I’m also seeing the crashes inflashinfer at startup and these were also present in the official Nvidia vLLM container that just came out.
Otherwise it runs pretty well, but then eventually just crashes. It may take a while to crash, but I haven’t been able to keep it stable.
When trying to run in a cluster, I get nowhere. It will start a query and then very shortly after go into a loop or crash.
I did try the AWQ quant, and indeed it runs even faster.
My big concern is that maybe we’re losing a bit of accuracy over NVFP4? My use-case is to use the model in a harness like OpenCode to manage software development projects, and the big problem is accuracy, especially where tool calls get broken. I got a broken tool call within minutes of starting a planning task, so that’s not really ideal. The NVFP4 quants seemed to give me more accurate responses, but then they just crashed after maybe an hour or so.
Let’s hope this whole NVFP4 stack can get sorted out pretty soon!
Yes, this seems to be W4A16 model, and NVFP4 quant was produced by NVIDIA itself with Post-Quantization Training, so AWQ quant will need a very good calibration dataset to get close in accuracy.
This is the sort of crash I’m seeing after a fair amount of processing -
(EngineCore_DP0 pid=100) ERROR 02-02 07:31:50 [core.py:968] torch.AcceleratorError: CUDA error: an illegal instruction was encountered
(EngineCore_DP0 pid=100) ERROR 02-02 07:31:50 [core.py:968] Search for cudaErrorIllegalInstruction' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (EngineCore_DP0 pid=100) ERROR 02-02 07:31:50 [core.py:968] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (EngineCore_DP0 pid=100) ERROR 02-02 07:31:50 [core.py:968] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (EngineCore_DP0 pid=100) ERROR 02-02 07:31:50 [core.py:968] Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.
Hi,
how do you do this exactly (commands-wise), especially this part: and 0.16.0rc1.dev81+g672023877.cu13 to get it running?
Using the world famous build job out of GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks
./build-and-copy.sh --use-wheels -pre-tf
The job pulls the nightly build of wheels by default. So I must have fetched yesterdays build of wheels.
--use-wheels [mode] : Use prebuilt vLLM wheels. Mode can be 'nightly' (default) or 'release'
--pre-tf, --pre-transformers : Install transformers 5.0.0rc0 or higher
And I used the latest transformers as of yesterday (because the 5.0.0 is needed for glm-4.7-flash).
Ah ok, it’s the same! I’ll try running building it again.
Also about vllm serve command, what parameters did you use?
Any specific ones for this:
stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ
similar to the glm ones:
launch-cluster.sh -t vllm-node-tf5 --solo \
exec vllm serve cyankiwi/GLM-4.7-Flash-AWQ-4bit \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-4.7-flash \
--max-model-len 131072 \
--max-num-batched-tokens 4096 \
--max-num-seqs 64 \
--host 0.0.0.0 --port 30000 \
--gpu-memory-utilization 0.7 \
--enable-expert-parallel
This is how I use the container:
cosinus@vroomfondel$ docker run --rm -it --gpus all --ipc=host --name vLLM -v $HOME/models:/models -v $HOME/.cache:/root/.cache -e HF_HUB_CACHE=/models -e HF_TOKEN=hf_replace_me --entrypoint /bin/bash -p 8000:8000 vllm-node:20260201
root@dd35fdc73d20:/workspace/vllm# wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4/resolve/main/nano_v3_reasoning_parser.py && \
vllm serve stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ \
--port 8000 \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser-plugin nano_v3_reasoning_parser.py \
--reasoning-parser nano_v3 \
--kv-cache-dtype fp8
I’ve added a mod that handles parser download - testing it now, and will commit to the repo shortly.