CUDA illegal memory access with MTP speculative decoding on Nemotron-3-Super-120B-NVFP4 (vLLM cu130-nightly, single DGX Spark GB10)

Hi all — running the official NVIDIA Nemotron-3-Super Spark Deployment Guide and hitting a hard blocker. MTP speculative decoding crashes at runtime with CUDA error: an
illegal memory access was encountered. Hoping someone has seen this / knows a workaround before I open an upstream issue.

Environment

  • Hardware: Single DGX Spark (GB10, 121 GB unified, SM121, ARM aarch64)
  • OS: Ubuntu (native)
  • Docker image: vllm/vllm-openai:cu130-nightly (digest 473de04c1538…, tag cu130-nightly-b075604da10a9e8ff23d23f63d5113d43f0e4208)
  • vLLM version reported at startup: 0.19.1rc1.dev257+gb075604da
  • Model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 (HF main, snapshot 0d6fa3ecad…)
  • Reasoning parser: super_v3_reasoning_parser.py from the official HF repo, mounted at /app/

Config (following NVIDIA official Spark Deployment Guide)

vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
–quantization nvfp4
–kv-cache-dtype fp8
–gpu-memory-utilization 0.75
–max-model-len 32768
–max-num-seqs 4
–moe-backend marlin
–attention-backend TRITON_ATTN
–enable-chunked-prefill
–enable-prefix-caching
–enforce-eager
–enable-auto-tool-choice --tool-call-parser hermes
–speculative-config ‘{“method”:“mtp”,“num_speculative_tokens”:3,“moe_backend”:“triton”}’
–reasoning-parser-plugin /app/super_v3_reasoning_parser.py
–reasoning-parser super_v3
–served-model-name nemotron-120b
–trust-remote-code --host 0.0.0.0 --port 8000

Env:
VLLM_NVFP4_GEMM_BACKEND=marlin
VLLM_USE_FLASHINFER_MOE_FP4=0
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

What works

  • Model loads cleanly (17/17 weight shards, draft layer loads, KV cache init OK)
  • MarlinNvFp4LinearKernel selected for main model
  • TRITON Unquantized MoE backend selected for the MTP draft (the inner speculative_config.moe_backend=triton override is respected — good!)
  • /health returns 200
  • Startup logs clean, no warnings beyond the usual Mamba prefix-cache “experimental” note

What breaks — first request triggers CUDA illegal memory access

First POST /v1/chat/completions causes the engine to die:

(EngineCore pid=175) ERROR 04-14 12:39:46 [core.py:1112]
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress’ in CUDA docs for more information.
CUDA kernel errors might be asynchronously reported at some other API call,
so the stacktrace below might be incorrect.

The API server then logs EngineDeadError and every subsequent request returns 500.

What I already tried

  1. Without MTP (same image, same model, just remove --speculative-config and reasoning parser): works fine, ~15 tok/s — this is my current stable baseline.
  2. VLLM_TEST_FORCE_FP8_MARLIN=1 on/off: no effect on the CUDA crash.
  3. --enforce-eager: already set. Removing it surfaces an earlier init-time error:
    AttributeError: ‘MergedColumnParallelLinear’ object has no attribute ‘workspace’
  4. at fp8_linear.apply_weights(…) → workspace=layer.workspace (marlin FP8 kernel, modelopt_mixed quantization path). Adding --enforce-eager bypasses this specific path but
    then the runtime CUDA error shows up on the first actual request.
  5. --moe-backend triton (main model, unified backend): rejected — moe_backend=‘triton’ is not supported for NvFP4 MoE.
  6. --moe-backend flashinfer_trtllm: rejected — FLASHINFER_TRTLLM does not support the deployment configuration since kernel does not support current device cuda (expected on
    GB10 per the PSA threads).
  7. v0.19.0-cu130-ubuntu2404 (without MTP): same MergedColumnParallelLinear has no workspace at init, so this bug exists on the stable 0.19 release too, not just nightly.
  8. v0.17.1-cu130 (stable, no MTP support): this is my working baseline. Adding --speculative-config fails with Unexpected keyword argument ‘moe_backend’ inside
    SpeculativeConfig (as expected — the inner moe_backend override landed later).

Questions

  1. Has anyone successfully run Nemotron-3-Super-120B-A12B-NVFP4 with MTP on a single GB10? If yes, exact image tag + flags + env would help a lot.
  2. Is the MergedColumnParallelLinear.workspace missing attribute in modelopt_mixed + marlin FP8 linear kernel a known issue? I can file a vLLM GitHub issue if not — just want
    to avoid duplicates.
  3. Is the runtime illegal memory access likely the same root cause as #2 (workspace not allocated leading to OOB indexing), or a separate problem in the triton unquantized MoE
    path for the draft layer? TurboQuant thread (link below) mentions _next_pow2() padding for Mamba/GDN layers, wondering if the draft MTP path needs similar handling.
  4. If MTP is currently broken on GB10 for this model, is there a known-good alternative for getting above the ~15 tok/s single-stream baseline on a single Spark? I saw
    Qwen3.5-122B-A10B-NVFP4 hitting 38–51 tok/s with MTP elsewhere in this forum — is the situation purely “Qwen MTP works, Nemotron MTP doesn’t” right now?

Related threads I’ve already read

  • TurboQuant integration (bjk110): has Nemotron-H c=1 around 15 tok/s, no MTP
  • Nemotron-3-Super NVFP4 via vLLM TP=2 on 2x Spark (24 tok/s): also no MTP
  • Qwen3.5-122B-A10B on single Spark up to 51 tok/s: uses MTP successfully on Qwen
  • Official NVIDIA Spark Deployment Guide for Nemotron-3-Super (the one I’m following)

Happy to run any additional diagnostic flags (CUDA_LAUNCH_BLOCKING=1, TORCH_USE_CUDA_DSA, TORCHDYNAMO_VERBOSE=1, TORCH_LOGS=“+dynamo”) and post results. Full docker logs
available on request.

Thanks!
@bjk110 @Albond

GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub will get you where you want to be.

See PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM - #205 by eugr

After spark-vllm-docker is installed and built, this should work unless something broke in a recent build:

./run-recipe.sh nemotron-3-super-nvfp4-flashinfer --solo --speculative-config '{"method": "mtp", "num_speculative_tokens": 2}'

Using Eugr’s spark-vllm-docker will work as a resolution, and additionally, the latest VLLM has specific fixes for this issue.