CUDA illegal memory access with MTP speculative decoding on Nemotron-3-Super-120B-NVFP4 (vLLM cu130-nightly, single DGX Spark GB10)

wooogamer · April 15, 2026, 7:15am

Hi all — running the official NVIDIA Nemotron-3-Super Spark Deployment Guide and hitting a hard blocker. MTP speculative decoding crashes at runtime with CUDA error: an
illegal memory access was encountered. Hoping someone has seen this / knows a workaround before I open an upstream issue.

Environment

Hardware: Single DGX Spark (GB10, 121 GB unified, SM121, ARM aarch64)
OS: Ubuntu (native)
Docker image: vllm/vllm-openai:cu130-nightly (digest 473de04c1538…, tag cu130-nightly-b075604da10a9e8ff23d23f63d5113d43f0e4208)
vLLM version reported at startup: 0.19.1rc1.dev257+gb075604da
Model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 (HF main, snapshot 0d6fa3ecad…)
Reasoning parser: super_v3_reasoning_parser.py from the official HF repo, mounted at /app/

Config (following NVIDIA official Spark Deployment Guide)

vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
–quantization nvfp4
–kv-cache-dtype fp8
–gpu-memory-utilization 0.75
–max-model-len 32768
–max-num-seqs 4
–moe-backend marlin
–attention-backend TRITON_ATTN
–enable-chunked-prefill
–enable-prefix-caching
–enforce-eager
–enable-auto-tool-choice --tool-call-parser hermes
–speculative-config ‘{“method”:“mtp”,“num_speculative_tokens”:3,“moe_backend”:“triton”}’
–reasoning-parser-plugin /app/super_v3_reasoning_parser.py
–reasoning-parser super_v3
–served-model-name nemotron-120b
–trust-remote-code --host 0.0.0.0 --port 8000

Env:
VLLM_NVFP4_GEMM_BACKEND=marlin
VLLM_USE_FLASHINFER_MOE_FP4=0
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

What works

Model loads cleanly (17/17 weight shards, draft layer loads, KV cache init OK)
MarlinNvFp4LinearKernel selected for main model
TRITON Unquantized MoE backend selected for the MTP draft (the inner speculative_config.moe_backend=triton override is respected — good!)
/health returns 200
Startup logs clean, no warnings beyond the usual Mamba prefix-cache “experimental” note

What breaks — first request triggers CUDA illegal memory access

First POST /v1/chat/completions causes the engine to die:

(EngineCore pid=175) ERROR 04-14 12:39:46 [core.py:1112]
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress’ in CUDA docs for more information.
CUDA kernel errors might be asynchronously reported at some other API call,
so the stacktrace below might be incorrect.

The API server then logs EngineDeadError and every subsequent request returns 500.

What I already tried

Without MTP (same image, same model, just remove --speculative-config and reasoning parser): works fine, ~15 tok/s — this is my current stable baseline.
VLLM_TEST_FORCE_FP8_MARLIN=1 on/off: no effect on the CUDA crash.
--enforce-eager: already set. Removing it surfaces an earlier init-time error:
AttributeError: ‘MergedColumnParallelLinear’ object has no attribute ‘workspace’
at fp8_linear.apply_weights(…) → workspace=layer.workspace (marlin FP8 kernel, modelopt_mixed quantization path). Adding --enforce-eager bypasses this specific path but
then the runtime CUDA error shows up on the first actual request.
--moe-backend triton (main model, unified backend): rejected — moe_backend=‘triton’ is not supported for NvFP4 MoE.
--moe-backend flashinfer_trtllm: rejected — FLASHINFER_TRTLLM does not support the deployment configuration since kernel does not support current device cuda (expected on
GB10 per the PSA threads).
v0.19.0-cu130-ubuntu2404 (without MTP): same MergedColumnParallelLinear has no workspace at init, so this bug exists on the stable 0.19 release too, not just nightly.
v0.17.1-cu130 (stable, no MTP support): this is my working baseline. Adding --speculative-config fails with Unexpected keyword argument ‘moe_backend’ inside
SpeculativeConfig (as expected — the inner moe_backend override landed later).

Questions

Has anyone successfully run Nemotron-3-Super-120B-A12B-NVFP4 with MTP on a single GB10? If yes, exact image tag + flags + env would help a lot.
Is the MergedColumnParallelLinear.workspace missing attribute in modelopt_mixed + marlin FP8 linear kernel a known issue? I can file a vLLM GitHub issue if not — just want
to avoid duplicates.
Is the runtime illegal memory access likely the same root cause as #2 (workspace not allocated leading to OOB indexing), or a separate problem in the triton unquantized MoE
path for the draft layer? TurboQuant thread (link below) mentions _next_pow2() padding for Mamba/GDN layers, wondering if the draft MTP path needs similar handling.
If MTP is currently broken on GB10 for this model, is there a known-good alternative for getting above the ~15 tok/s single-stream baseline on a single Spark? I saw
Qwen3.5-122B-A10B-NVFP4 hitting 38–51 tok/s with MTP elsewhere in this forum — is the situation purely “Qwen MTP works, Nemotron MTP doesn’t” right now?

Related threads I’ve already read

TurboQuant integration (bjk110): has Nemotron-H c=1 around 15 tok/s, no MTP
Nemotron-3-Super NVFP4 via vLLM TP=2 on 2x Spark (24 tok/s): also no MTP
Qwen3.5-122B-A10B on single Spark up to 51 tok/s: uses MTP successfully on Qwen
Official NVIDIA Spark Deployment Guide for Nemotron-3-Super (the one I’m following)

Happy to run any additional diagnostic flags (CUDA_LAUNCH_BLOCKING=1, TORCH_USE_CUDA_DSA, TORCHDYNAMO_VERBOSE=1, TORCH_LOGS=“+dynamo”) and post results. Full docker logs
available on request.

Thanks!
@bjk110 @Albond

chibri · April 15, 2026, 2:35pm

GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub will get you where you want to be.

See PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM - #205 by eugr

After spark-vllm-docker is installed and built, this should work unless something broke in a recent build:

./run-recipe.sh nemotron-3-super-nvfp4-flashinfer --solo --speculative-config '{"method": "mtp", "num_speculative_tokens": 2}'

Neill · June 4, 2026, 8:34pm

Using Eugr’s spark-vllm-docker will work as a resolution, and additionally, the latest VLLM has specific fixes for this issue.

Topic		Replies	Views
Help running Nemotron 3 Nano 30B-A3B-FP8 on DGX Spark (GB10) DGX Spark / GB10 spark , nim , nemotron	41	3584	January 24, 2026
Nemotron 3 Super Improvements and Fixes NVIDIA Nemotron nim , nemotron	6	706	April 13, 2026
NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 DGX Spark / GB10 nemotron	89	10802	March 31, 2026
Nemotron-3-Super-120B-A12B-NVFP4 + MTP on 4× DGX Spark via SGLang (TP=4, RoCE) - MTP actually pays off: 1.70× single-stream, accept-len ≈ 2.7 DGX Spark / GB10 Projects cudnn , nemotron	5	351	June 18, 2026
New nvcr.io/nvidia/vllm:26.03.post1-py3 loads Nemotron-3-Super-120B-A12B-NVFP4 DGX Spark / GB10 nemotron	0	314	April 17, 2026
Nemotron-3-Nano-30B-A3B-NVFP4 ultra-efficient NVFP4 precision version of Nemotron 3 Nano DGX Spark / GB10 jetson , nemotron	84	3573	March 20, 2026
DGX Spark, Nemotron3, and NVFP4: Getting to 65+ tps DGX Spark / GB10 spark , nemotron , dgx	14	2367	December 22, 2025
Nemotron-3-Super NVFP4 via vLLM TP=2 on 2x DGX Spark — 24 tok/s (ABI fix for cu130/cu132 mismatch) DGX Spark / GB10 Projects spark , nemotron	1	547	March 26, 2026
Nemotron-3-Super-120B-A12B-NVFP4 on single DGX Spark: 23.45 tok/s (spark-arena.com/ benhmarks) DGX Spark / GB10 cuda , benchmarks , spark , llm , nemotron , dgx , nemoclaw	6	1180	May 26, 2026
Nemotron-3-Super-120B-A12B-NVFP4 with vllm v0.22.0 on 1x Acer GB10 with 495.71.05 driver (Container is Ubuntu 24.04, CUDA 13.2.1, gxx11) DGX Spark / GB10 cuda , containers , nemotron	8	785	June 1, 2026

CUDA illegal memory access with MTP speculative decoding on Nemotron-3-Super-120B-NVFP4 (vLLM cu130-nightly, single DGX Spark GB10)

Related topics