Removed both cubin and jit-cache, now nemotron-3-super crashes on startup instead of inference, with the same illegal instruction stuff.
blow away whatever is in ~/.cache/flashinfer
edit: cache not config
try sudo rm -rf ~/.cache/flashinfer/
That helped with the startup, but eventually it crashed during inference with illegal instruction. That’s with FLASHINFER_CUTLASS MoE backend, VLLM_CUTLASS works fine (but it did work fine before as well).
I’ll try super after this run, what benchmark are you using to cause it and I can see if I can replicate.
Put the steps to replicate
just llama-benchy --base-url http://spark3.home.eugr.net:8888/v1 --depth 0 4096 16384 32078 65535 100000 200000 - it usually crashes before even reaching 32768.
To reproduce:
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
--kv-cache-dtype fp8 \
--trust-remote-code \
--gpu-memory-utilization 0.7\
--max-model-len auto\
--max-num-seqs 10 \
--enable-prefix-caching \
--host 0.0.0.0 \
--port 8888\
--enable-auto-tool-choice \
--load-format fastsafetensors \
--tool-call-parser qwen3_coder \
--reasoning-parser nemotron_v3 \
--mamba_ssm_cache_dtype float32 \
--attention-backend TRITON_ATTN
Not sure if it’s relevant for the crashing but mine is using flashinfer attention not triton.
are you using triton 3.6.0? It is bugged for dgx spark and agx thor, in my case waiting for 3.7.0 release to start to use it
to be more hard, sudo rm -rf ~/.cache/ that remove also vllm caches and clean start
Crashed with this in dmesg, took 5 hours but I don’t even think I’ve seen one like this before.
NVRM: Xid (PCI:000f:01:00): 31, pid=73226, name=VLLM::EngineCor, channel 0x00000002, intr 00000000. MMU Fault: ENGINE GRAPHICS GPC1 GPCCLIENT_T1_11 faulted @ 0x0_04000000. Fault is of type FAULT_PTE ACCESS_TYPE_VIRT_READ
21 million tokens down the drain, it was looking so promising
865/990 running_score=0.7734 elapsed=17962.10445919598
I got these illegal instruction errors on both nano and super (there was a thread about it on nano at nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 · Tool use crash the model). I gave up on nemotron models 😞
you have to use my PRs: one was merged, the another one:
then uninstall flashinfer-cubin, and install flashinfer-python from main to get the best performance.
sincerely, no idea… Most vLLM-related Xid errors people see are Xid 48 (double-bit ECC) or Xid 63 (row remapping failure)…. those are straightforward VRAM cell failures. Xid 31 is different because it’s a page table / address translation fault (so it seems pressure in page table walker)
running this now to reproduce it
Didn’t crash during this workload, and I’ve never seen my gb10 pull 170 watts from the wall until today:
Running coherence test...
Coherence test PASSED.
Measuring latency using mode: api...
Average latency (api): 1.82 ms
Running test: pp=2048, tg=32, depth=0, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=4096, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=16384, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=32078, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=65535, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=100000, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=200000, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Printing results in MD format:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:-----------------------------------------------|-----------------:|-----------------:|-------------:|------------------:|------------------:|------------------:|
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 | 1798.45 ± 491.34 | | 1256.41 ± 424.09 | 1254.58 ± 424.09 | 1256.44 ± 424.09 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg32 | 14.32 ± 0.05 | 15.00 ± 0.00 | | | |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 @ d4096 | 1589.24 ± 818.61 | | 6636.70 ± 5375.44 | 6634.87 ± 5375.44 | 6636.72 ± 5375.44 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg32 @ d4096 | 14.42 ± 0.02 | 15.00 ± 0.00 | | | |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 @ d16384 | 2134.77 ± 5.50 | | 8636.08 ± 22.27 | 8634.26 ± 22.27 | 8636.11 ± 22.27 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg32 @ d16384 | 14.37 ± 0.06 | 15.00 ± 0.00 | | | |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 @ d32078 | 2044.96 ± 29.19 | | 16693.15 ± 240.65 | 16691.33 ± 240.65 | 16693.18 ± 240.65 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg32 @ d32078 | 14.25 ± 0.03 | 15.00 ± 0.00 | | | |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 @ d65535 | 1952.16 ± 1.03 | | 34621.45 ± 18.21 | 34619.63 ± 18.21 | 34621.48 ± 18.21 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg32 @ d65535 | 14.22 ± 0.08 | 15.00 ± 0.00 | | | |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 @ d100000 | 1829.69 ± 1.46 | | 55775.20 ± 44.50 | 55773.37 ± 44.50 | 55775.23 ± 44.50 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg32 @ d100000 | 14.11 ± 0.05 | 15.00 ± 0.00 | | | |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 @ d200000 | 1559.48 ± 0.68 | | 129563.26 ± 56.15 | 129561.44 ± 56.15 | 129563.29 ± 56.15 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg32 @ d200000 | 14.08 ± 0.10 | 15.33 ± 0.47 | | | |
llama-benchy (0.3.5)
date: 2026-03-28 18:39:10 | latency mode: api
Going to power it off and try again on the GPQA pass.
I’ve built with your PR applied:
2026-03-28T16:22:47.700847Z 01E 2026-03-28 09:22:47,700 - INFO - #15 [vllm-builder 5/7] RUN curl -fsL https://patch-diff.githubusercontent.com/raw/vllm-project/vllm/pull/38423.diff -o pr38423.diff && if git apply --reverse --check pr38423.diff 2>/dev/null; then echo "Patch already applied, skipping."; else echo "Applying patch..."; git apply -v pr38423.diff; fi && rm pr38423.diff
2026-03-28T16:22:48.099330Z 01E 2026-03-28 09:22:48,099 - INFO - #15 0.549 Applying patch...
2026-03-28T16:22:48.312793Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.549 Checking patch CMakeLists.txt...
2026-03-28T16:22:48.312800Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.549 Checking patch csrc/quantization/fp4/nvfp4_quant_entry.cu...
2026-03-28T16:22:48.312804Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.549 Checking patch csrc/quantization/fp4/nvfp4_scaled_mm_entry.cu...
2026-03-28T16:22:48.312810Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.549 Checking patch csrc/quantization/machete/machete_mainloop.cuh...
2026-03-28T16:22:48.312827Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.550 Checking patch docker/Dockerfile...
2026-03-28T16:22:48.312839Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.550 Checking patch docker/Dockerfile.nightly_torch...
2026-03-28T16:22:48.312850Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.550 Checking patch docker/versions.json...
2026-03-28T16:22:48.312862Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.550 Checking patch requirements/cuda.txt...
2026-03-28T16:22:48.312875Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.550 Checking patch vllm/model_executor/layers/quantization/utils/nvfp4_utils.py...
2026-03-28T16:22:48.312890Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.593 Applied patch CMakeLists.txt cleanly.
2026-03-28T16:22:48.312899Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.593 Applied patch csrc/quantization/fp4/nvfp4_quant_entry.cu cleanly.
2026-03-28T16:22:48.312911Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.593 Applied patch csrc/quantization/fp4/nvfp4_scaled_mm_entry.cu cleanly.
2026-03-28T16:22:48.312923Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.593 Applied patch csrc/quantization/machete/machete_mainloop.cuh cleanly.
2026-03-28T16:22:48.312936Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.593 Applied patch docker/Dockerfile cleanly.
2026-03-28T16:22:48.312948Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.593 Applied patch docker/Dockerfile.nightly_torch cleanly.
2026-03-28T16:22:48.312967Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.593 Applied patch docker/versions.json cleanly.
2026-03-28T16:22:48.312998Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.593 Applied patch requirements/cuda.txt cleanly.
2026-03-28T16:22:48.313010Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.593 Applied patch vllm/model_executor/layers/quantization/utils/nvfp4_utils.py cleanly.
2026-03-28T16:22:48.313022Z 01E 2026-03-28 09:22:48,313 - INFO - #15 DONE 0.6s
What performance do you get from this model?
Please use llama-benchy to benchmark, vLLM logs do not represent reality as seen by a client side.
I’m also looking forward to some kind of comparison. This topic has reached the TOP in terms of the number of messages, but it is not yet clear where the increase in speed and quality is.)
From what I’ve seen the biggest increase that you should expect from marlin → cutlass is in prefill or high batch/concurrency, if you’re waiting for a lower batch (less than 16 concurrent) or single user decode bump that’s going to have to come from kv cache quant under fp8 currently.
There are absolutely software performance increases to be had, but at least with the cutlass mma instructions and hardware available to the spark I don’t see it coming just from this.
Edit: note - I would gladly be wrong though
