PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM

eugr · March 28, 2026, 9:10pm

Removed both cubin and jit-cache, now nemotron-3-super crashes on startup instead of inference, with the same illegal instruction stuff.

trystan1 · March 28, 2026, 9:10pm

blow away whatever is in ~/.cache/flashinfer

edit: cache not config

johnny_nv · March 28, 2026, 9:13pm

try sudo rm -rf ~/.cache/flashinfer/

eugr · March 28, 2026, 9:25pm

That helped with the startup, but eventually it crashed during inference with illegal instruction. That’s with FLASHINFER_CUTLASS MoE backend, VLLM_CUTLASS works fine (but it did work fine before as well).

trystan1 · March 28, 2026, 9:28pm

I’ll try super after this run, what benchmark are you using to cause it and I can see if I can replicate.

johnny_nv · March 28, 2026, 9:28pm

Put the steps to replicate

eugr · March 28, 2026, 9:33pm

just llama-benchy --base-url http://spark3.home.eugr.net:8888/v1 --depth 0 4096 16384 32078 65535 100000 200000 - it usually crashes before even reaching 32768.

To reproduce:

export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
  --kv-cache-dtype fp8 \
  --trust-remote-code \
  --gpu-memory-utilization 0.7\
  --max-model-len auto\
  --max-num-seqs 10 \
  --enable-prefix-caching \
  --host 0.0.0.0 \
  --port 8888\
  --enable-auto-tool-choice \
  --load-format fastsafetensors \
  --tool-call-parser qwen3_coder \
  --reasoning-parser nemotron_v3 \
  --mamba_ssm_cache_dtype float32 \
  --attention-backend TRITON_ATTN

trystan1 · March 28, 2026, 9:39pm

Not sure if it’s relevant for the crashing but mine is using flashinfer attention not triton.

johnny_nv · March 28, 2026, 9:52pm

are you using triton 3.6.0? It is bugged for dgx spark and agx thor, in my case waiting for 3.7.0 release to start to use it

to be more hard, sudo rm -rf ~/.cache/ that remove also vllm caches and clean start

trystan1 · March 28, 2026, 10:09pm

Crashed with this in dmesg, took 5 hours but I don’t even think I’ve seen one like this before.

NVRM: Xid (PCI:000f:01:00): 31, pid=73226, name=VLLM::EngineCor, channel 0x00000002, intr 00000000. MMU Fault: ENGINE GRAPHICS GPC1 GPCCLIENT_T1_11 faulted @ 0x0_04000000. Fault is of type FAULT_PTE ACCESS_TYPE_VIRT_READ

21 million tokens down the drain, it was looking so promising

865/990 running_score=0.7734 elapsed=17962.10445919598

DannyTup · March 28, 2026, 10:10pm

I got these illegal instruction errors on both nano and super (there was a thread about it on nano at nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 · Tool use crash the model). I gave up on nemotron models 😞

johnny_nv · March 28, 2026, 10:18pm

you have to use my PRs: one was merged, the another one:

github.com/vllm-project/vllm

[NVIDIA] Bugfix NVFP4 DGX Spark and RTX50 (#38423)

main ← johnnynunez:main

opened 07:58AM - 28 Mar 26 UTC

johnnynunez

+70 -10

## Summary Fix `cudaErrorIllegalInstruction` when running NVFP4 models (e.g. …`nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4`) on SM12x GPUs (RTX 50 series SM120, DGX Spark SM121). ### Root causes 1. **CUTLASS v4.2.2 lacks SM12x NVFP4 tile constraints** — The bundled CUTLASS was missing SM120f family-level compilation support for NVFP4/MX Grouped GEMM and SM121-specific tile configurations (DGX Spark). This caused `IllegalInstruction` during decode when small-M tile variants were selected. Related upstream: [NVIDIA/cutlass#3038](https://github.com/NVIDIA/cutlass/pull/3038). 2. **FlashInfer 0.6.6 bundles CUTLASS 4.2.1** — The FlashInfer CUTLASS MoE backend failed on SM12x with `Failed to initialize cutlass TMA WS grouped gemm` due to the same missing tile constraints. Fixed upstream in [flashinfer-ai/flashinfer#2798](https://github.com/flashinfer-ai/flashinfer/pull/2798). 3. **`cutlass_scaled_mm_supports_fp4()` reported false availability** — Only checked CUDA runtime version (`>= 12080`), not whether the SM-specific kernel was actually compiled. On a build with only `ENABLE_NVFP4_SM100`, it incorrectly reported CUTLASS as available for SM12x, then failed at dispatch. 4. **Quantization kernels had no SM runtime guard** — The `scaled_fp4_quant`, `silu_and_mul_nvfp4_quant`, and expert quant entry points dispatched to `_sm1xxa` kernels if *any* SM1xx was compiled, with no runtime check. If only SM100 SASS existed, CUDA would JIT-compile SM100 PTX for SM120 (different major arch), producing illegal instructions asynchronously — surfacing later at `synchronize()` as an opaque CUDA error. 5. **FlashInfer CUTLASS backend bypassed quant kernel checks** — `select_nvfp4_linear_backend()` selected FlashInfer CUTLASS solely on `has_device_capability(100)`, without verifying the vLLM quantization kernels (used by all non-Marlin backends) were compiled for the current SM. ### Changes | File | Change | |---|---| | `CMakeLists.txt` | Bump CUTLASS from v4.2.2 to **v4.4.2** — enables SM120f (family) compilation for NVFP4/MX Grouped GEMM, covering RTX 50 (SM120) and DGX Spark (SM121) | | `docker/Dockerfile` | Bump FlashInfer from 0.6.6 to **0.6.7** (includes CUTLASS 4.4.2, fixes TMA grouped GEMM on SM12x) | | `docker/Dockerfile.nightly_torch` | Same FlashInfer bump (source build) | | `docker/versions.json` | `FLASHINFER_VERSION`: `0.6.6` → `0.6.7` | | `nvfp4_scaled_mm_entry.cu` | `cutlass_scaled_mm_supports_fp4()` now checks compile-time `ENABLE_NVFP4_SM100`/`ENABLE_NVFP4_SM120` guards per SM range instead of a blanket `>= 100` check | | `nvfp4_quant_entry.cu` | Added `nvfp4_quant_sm_supported()` runtime guard to all four quant entry points (`scaled_fp4_quant`, `scaled_fp4_experts_quant`, `silu_and_mul_nvfp4_quant`, `silu_and_mul_scaled_fp4_experts_quant`) | | `nvfp4_utils.py` | `select_nvfp4_linear_backend()` gates FlashInfer CUTLASS on `cutlass_fp4_supported()` + adds validation assert for all FlashInfer backends | ### What is NOT changed **Marlin remains a valid fallback on SM12x.** Marlin FP4 uses weight-only dequantization to BF16 — it does not use native FP4 tensor core instructions and works correctly on all Blackwell architectures including DGX Spark. Benchmarks confirm Marlin is stable on SM121 (~558 tok/s, on par with vLLM CUTLASS at ~562 tok/s). The Marlin path (`apply_fp4_marlin_linear`) bypasses the vLLM quant kernels entirely, so the SM guards in `nvfp4_quant_entry.cu` do not affect it. ### Behavior on SM12x after this PR | Scenario | Before | After | |---|---|---| | Build includes `ENABLE_NVFP4_SM120` + CUTLASS v4.4.2 | `IllegalInstruction` | Native CUTLASS backend selected, works correctly | | Build lacks `ENABLE_NVFP4_SM120` | `IllegalInstruction` (SM100 PTX JIT to SM120) | Native CUTLASS correctly reports unavailable; **Marlin selected as fallback** — works correctly | | FlashInfer CUTLASS MoE on SM12x | `Failed to initialize cutlass TMA WS grouped gemm` (CUTLASS 4.2.1 in FlashInfer 0.6.6) | Works correctly with FlashInfer 0.6.7 (CUTLASS 4.4.2) | ### Follow-up: FlashInfer 0.6.8 [flashinfer-ai/flashinfer#2738](https://github.com/flashinfer-ai/flashinfer/pull/2738) (merged March 28, 2026) adds native NVFP4 and MXFP4 group GEMM support for SM120/SM121 (RTX 50 / DGX Spark) directly in FlashInfer. This will land in FlashInfer **0.6.8**. Once released, `FLASHINFER_VERSION` should be bumped in `docker/Dockerfile`, `docker/Dockerfile.nightly_torch`, and `docker/versions.json` to unlock FlashInfer's own SM12x NVFP4/MXFP4 kernels (including GDC unguarding and PDL group GEMM fixes). TODO comments have been added to both Dockerfiles tracking this. ## Test plan - [x] Build with `CUDA_ARCHS="12.0a;12.1a"` on DGX Spark (SM121), verify NVFP4 model serves with vLLM CUTLASS backend (`VLLM_NVFP4_GEMM_BACKEND=cutlass --moe-backend=cutlass`) - [x] Verify FlashInfer CUTLASS MoE on SM12x no longer hits TMA init error - [x] Build with `CUDA_ARCHS="12.0a;12.1a"`, verify Marlin fallback still works (`VLLM_NVFP4_GEMM_BACKEND=marlin --moe-backend=marlin`) - [x] Build with `CUDA_ARCHS="10.0a"` only, verify Marlin fallback on SM12x (no `IllegalInstruction`) - [x] Verify SM100 (B200) still works with native CUTLASS (no regression from CUTLASS bump) - [x] Verify SM89/SM90 still works (pre-Blackwell unaffected) - [ ] Run `tests/models/quantization/test_nvfp4.py` on SM120+ - [ ] Docker build completes with FlashInfer 0.6.7 for both `Dockerfile` and `Dockerfile.nightly_torch`

then uninstall flashinfer-cubin, and install flashinfer-python from main to get the best performance.

johnny_nv · March 28, 2026, 10:22pm

sincerely, no idea… Most vLLM-related Xid errors people see are Xid 48 (double-bit ECC) or Xid 63 (row remapping failure)…. those are straightforward VRAM cell failures. Xid 31 is different because it’s a page table / address translation fault (so it seems pressure in page table walker)

trystan1 · March 28, 2026, 10:24pm

eugr:

vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
  --kv-cache-dtype fp8 \
  --trust-remote-code \
  --gpu-memory-utilization 0.7\
  --max-model-len auto\
  --max-num-seqs 10 \
  --enable-prefix-caching \
  --host 0.0.0.0 \
  --port 8888\
  --enable-auto-tool-choice \
  --load-format fastsafetensors \
  --tool-call-parser qwen3_coder \
  --reasoning-parser nemotron_v3 \
  --mamba_ssm_cache_dtype float3

running this now to reproduce it

johnny_nv · March 28, 2026, 10:26pm

github.com/pytorch/pytorch

[RFC] Release 2.11 and 2.12 Triton update

opened 09:09PM - 28 Jan 26 UTC

atalman

triaged oncall: pt2 module: inductor upstream triton bot-triaged

### 🐛 Describe the bug Opening this RFC to determine if Triton Update and relea…se of triton 3.7 is Required for 2.11. And track the update process if we need to provide new version of Triton. Are there any outstanding features from latest Triton main branch that we want to bring to PyTorch 2.11 release ? Current CI Triton update 26.01.2026: https://github.com/pytorch/pytorch/pull/173416 Triton update 28.01.2026: https://github.com/pytorch/pytorch/pull/173698 Triton update 29.01.2026: https://github.com/pytorch/pytorch/pull/173872 Triton update 02.02.2026: https://github.com/pytorch/pytorch/pull/173416 Triton update 03.02.2026: https://github.com/pytorch/pytorch/pull/173698 Triton update 04.02.2026: https://github.com/pytorch/pytorch/pull/173698 Triton update 10.02.2026: https://github.com/pytorch/pytorch/pull/173698 Triton update 16.02.2026: https://github.com/pytorch/pytorch/pull/174896 Triton update 18.02.2026: https://github.com/pytorch/pytorch/pull/174896 1. [x] https://github.com/pytorch/pytorch/issues/173795 - Not happening on 04.02.2026: 2. [ ] https://github.com/pytorch/pytorch/issues/173797 3. [ ] https://github.com/pytorch/pytorch/issues/173800 4. [x] https://github.com/pytorch/pytorch/issues/173871 - 04.02.2026 - Duplicate 5. [x] https://github.com/pytorch/pytorch/issues/174304 - 03.02.2026 failures 6. [x] https://github.com/pytorch/pytorch/issues/174306 - 03.02.2026 failures 7. [x] https://github.com/pytorch/pytorch/issues/174311 - 03.02.2026 failures 8. [ ] https://github.com/pytorch/pytorch/issues/174313 - ROCm Failures on 03.02.2026 failures 9. [ ] https://github.com/pytorch/pytorch/issues/174417 - 04.02.2026 failure 10. [x] https://github.com/pytorch/pytorch/issues/174418 - 04.02.2026 failure 11. [x] https://github.com/pytorch/pytorch/issues/174420 - 04.02.2026 failure cc @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo @bertmaher @int3 @davidberard98 @nmacchioni @embg @peterbell10 @iupaikov-amd @neildhar @njriasan @malfet @donigian @zou3519 @Lucaskabela ### Versions 2.11

trystan1 · March 28, 2026, 10:55pm

Didn’t crash during this workload, and I’ve never seen my gb10 pull 170 watts from the wall until today:

Running coherence test...
Coherence test PASSED.
Measuring latency using mode: api...
Average latency (api): 1.82 ms
Running test: pp=2048, tg=32, depth=0, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=4096, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=16384, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=32078, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=65535, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=100000, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=200000, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Printing results in MD format:



| model                                          |             test |              t/s |     peak t/s |         ttfr (ms) |      est_ppt (ms) |     e2e_ttft (ms) |
|:-----------------------------------------------|-----------------:|-----------------:|-------------:|------------------:|------------------:|------------------:|
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 |           pp2048 | 1798.45 ± 491.34 |              |  1256.41 ± 424.09 |  1254.58 ± 424.09 |  1256.44 ± 424.09 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 |             tg32 |     14.32 ± 0.05 | 15.00 ± 0.00 |                   |                   |                   |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 |   pp2048 @ d4096 | 1589.24 ± 818.61 |              | 6636.70 ± 5375.44 | 6634.87 ± 5375.44 | 6636.72 ± 5375.44 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 |     tg32 @ d4096 |     14.42 ± 0.02 | 15.00 ± 0.00 |                   |                   |                   |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 |  pp2048 @ d16384 |   2134.77 ± 5.50 |              |   8636.08 ± 22.27 |   8634.26 ± 22.27 |   8636.11 ± 22.27 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 |    tg32 @ d16384 |     14.37 ± 0.06 | 15.00 ± 0.00 |                   |                   |                   |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 |  pp2048 @ d32078 |  2044.96 ± 29.19 |              | 16693.15 ± 240.65 | 16691.33 ± 240.65 | 16693.18 ± 240.65 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 |    tg32 @ d32078 |     14.25 ± 0.03 | 15.00 ± 0.00 |                   |                   |                   |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 |  pp2048 @ d65535 |   1952.16 ± 1.03 |              |  34621.45 ± 18.21 |  34619.63 ± 18.21 |  34621.48 ± 18.21 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 |    tg32 @ d65535 |     14.22 ± 0.08 | 15.00 ± 0.00 |                   |                   |                   |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 @ d100000 |   1829.69 ± 1.46 |              |  55775.20 ± 44.50 |  55773.37 ± 44.50 |  55775.23 ± 44.50 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 |   tg32 @ d100000 |     14.11 ± 0.05 | 15.00 ± 0.00 |                   |                   |                   |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 @ d200000 |   1559.48 ± 0.68 |              | 129563.26 ± 56.15 | 129561.44 ± 56.15 | 129563.29 ± 56.15 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 |   tg32 @ d200000 |     14.08 ± 0.10 | 15.33 ± 0.47 |                   |                   |                   |

llama-benchy (0.3.5)
date: 2026-03-28 18:39:10 | latency mode: api

Going to power it off and try again on the GPQA pass.

eugr · March 28, 2026, 11:27pm

I’ve built with your PR applied:

2026-03-28T16:22:47.700847Z 01E 2026-03-28 09:22:47,700 - INFO - #15 [vllm-builder 5/7] RUN curl -fsL https://patch-diff.githubusercontent.com/raw/vllm-project/vllm/pull/38423.diff -o pr38423.diff     && if git apply --reverse --check pr38423.diff 2>/dev/null; then          echo "Patch already applied, skipping.";        else          echo "Applying patch...";          git apply -v pr38423.diff;        fi     && rm pr38423.diff
2026-03-28T16:22:48.099330Z 01E 2026-03-28 09:22:48,099 - INFO - #15 0.549 Applying patch...
2026-03-28T16:22:48.312793Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.549 Checking patch CMakeLists.txt...
2026-03-28T16:22:48.312800Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.549 Checking patch csrc/quantization/fp4/nvfp4_quant_entry.cu...
2026-03-28T16:22:48.312804Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.549 Checking patch csrc/quantization/fp4/nvfp4_scaled_mm_entry.cu...
2026-03-28T16:22:48.312810Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.549 Checking patch csrc/quantization/machete/machete_mainloop.cuh...
2026-03-28T16:22:48.312827Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.550 Checking patch docker/Dockerfile...
2026-03-28T16:22:48.312839Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.550 Checking patch docker/Dockerfile.nightly_torch...
2026-03-28T16:22:48.312850Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.550 Checking patch docker/versions.json...
2026-03-28T16:22:48.312862Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.550 Checking patch requirements/cuda.txt...
2026-03-28T16:22:48.312875Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.550 Checking patch vllm/model_executor/layers/quantization/utils/nvfp4_utils.py...
2026-03-28T16:22:48.312890Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.593 Applied patch CMakeLists.txt cleanly.
2026-03-28T16:22:48.312899Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.593 Applied patch csrc/quantization/fp4/nvfp4_quant_entry.cu cleanly.
2026-03-28T16:22:48.312911Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.593 Applied patch csrc/quantization/fp4/nvfp4_scaled_mm_entry.cu cleanly.
2026-03-28T16:22:48.312923Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.593 Applied patch csrc/quantization/machete/machete_mainloop.cuh cleanly.
2026-03-28T16:22:48.312936Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.593 Applied patch docker/Dockerfile cleanly.
2026-03-28T16:22:48.312948Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.593 Applied patch docker/Dockerfile.nightly_torch cleanly.
2026-03-28T16:22:48.312967Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.593 Applied patch docker/versions.json cleanly.
2026-03-28T16:22:48.312998Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.593 Applied patch requirements/cuda.txt cleanly.
2026-03-28T16:22:48.313010Z 01E 2026-03-28 09:22:48,312 - INFO - #15 0.593 Applied patch vllm/model_executor/layers/quantization/utils/nvfp4_utils.py cleanly.
2026-03-28T16:22:48.313022Z 01E 2026-03-28 09:22:48,313 - INFO - #15 DONE 0.6s

eugr · March 28, 2026, 11:29pm

What performance do you get from this model?

Please use llama-benchy to benchmark, vLLM logs do not represent reality as seen by a client side.

voktolom · March 28, 2026, 11:35pm

I’m also looking forward to some kind of comparison. This topic has reached the TOP in terms of the number of messages, but it is not yet clear where the increase in speed and quality is.)

trystan1 · March 29, 2026, 12:11am

From what I’ve seen the biggest increase that you should expect from marlin → cutlass is in prefill or high batch/concurrency, if you’re waiting for a lower batch (less than 16 concurrent) or single user decode bump that’s going to have to come from kv cache quant under fp8 currently.

There are absolutely software performance increases to be had, but at least with the cutlass mma instructions and hardware available to the spark I don’t see it coming just from this.

Edit: note - I would gladly be wrong though

Topic		Replies	Views
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	7303	March 28, 2026
From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f DGX Spark / GB10	10	1610	January 7, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2410	December 25, 2025
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1370	February 13, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2967	December 31, 2025
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	90	4338	February 27, 2026
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	75	5378	May 4, 2026
FP4 on DGX Spark — Why It Doesn't Scale Like You'd Expect DGX Spark / GB10	214	5586	March 27, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	27	2621	March 26, 2026
Qwen3-Next AWQ 4bit vs FP8 vs NVFP4 on single spark DGX Spark / GB10 jetson , llama , nemotron	7	1990	February 23, 2026

PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM

Related topics