We are pleased to announce several improvements since the initial Nemotron 3 Super launch on Mar 11, 2026.
What you need to know:
-
force_nonempty_content now works in streaming mode. Please note that the content field is only not empty in the last response from the server, where it duplicates all the content from the reasoning field (TensorRT-LLM: details).
-
Fixed the support for tool calling when using the qwen3coder tool parser in vLLM and TRT-LLM. With the fix it should return an object instead of a string when using the anyOf mode (vLLM: details; TensorRT-LLM: details).
The fixes are merged into releases:
NVIDIA NIM: https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/nemotron-3-super-120b-a12b/tags?version=latest
Thanks for the update. The tool calling fix in vLLM 0.18.0 is exactly what we need, but unfortunately 0.18.0+ introduces a regression on DGX Spark (SM12.1) with NVFP4 via Marlin:
âMergedColumnParallelLinearâ object has no attribute âworkspaceâ
Weâre currently pinned back to 0.17.2rc1 to keep Nemotron Super running. Is there a timeline for when the SM12.1 NVFP4 Marlin path will be stable on 0.18.x? Or is TRT-LLM the recommended path for DGX Spark users who need working tool calls today?
@digiegg Thank you so much for testing it out!
Could you, please, share more details on your setup: which image youâve used, which command etc.?
Iâve tried running Marlin path in vLLM 0.18.0 and did not encounter any errors so would like to understand where the difference between our setups lies.
Reply to askliar:
Setup details:
â Hardware: NVIDIA DGX Spark GB10 (SM12.1, aarch64, 128GB unified memory)
â Image: custom build via eugr/spark-vllm-docker, vLLM 0.17.2rc1 (commit 9c7cab5eb)
â Model: Nemotron 3 Super 120B NVFP4
â Key flags: --attention-backend TRITON_ATTN, --moe-backend cutlass, --load-format safetensors, --kv-cache-dtype fp8, gpu_memory_utilization 0.7, max_model_len 262144
â Error on 0.18.x: âMergedColumnParallelLinearâ object has no attribute âworkspaceâ â occurs at model load, before serving begins
We ultimately sidestepped the issue by switching to Qwen3.5-35B-A3B-FP8 as our primary model, which doesnât hit this path. We also found GitHub - saifgithub/vllm-gb10-sm121: vLLM FP8 fix for NVIDIA GB10 / SM12.1 (DGX Spark) â enable_sm120_only to enable_sm120_family · GitHub which identifies the root cause as enable_sm120_only vs enable_sm120_family in the CUTLASS FP8 kernel â a one-line fix in two files. Not sure if that patch has been considered for upstream merge.
@digiegg Could you, please, share the entire vllm serve command?
Otherwise, itâs hard to tell what exactly is failing. One observation already is that you are not using Marlin based on the flags youâve provided but, rather, CUTLASS + Triton.
but please, share the entire command and Iâll test it!
Thanks for following up! Here are both configs â working and failing â to help isolate the difference.
Hardware: DGX Spark GB10 (SM12.1, aarch64, 128GB unified memory)
Model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
â
WORKING â vLLM 0.17.2rc1 (commit 9c7cab5eb), Marlin backend:
ENV:
VLLM_NVFP4_GEMM_BACKEND=marlin
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
Command:
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
âserved-model-name nemotron-super
âhost 0.0.0.0 --port 8000
âattention-backend TRITON_ATTN
âmoe-backend marlin
âload-format safetensors
âkv-cache-dtype fp8
âgpu-memory-utilization 0.7
âmax-model-len 262144
âmamba_ssm_cache_dtype float32
âreasoning-parser nemotron_v3
âtool-call-parser qwen3_coder
âtensor-parallel-size 1
âdistributed-executor-backend ray
â FAILING â vLLM 0.18.1rc1 (commit 290809456, eugr spark-vllm-docker build, CUDA 13.2), CUTLASS backend:
ENV:
VLLM_FLASHINFER_ALLREDUCE_BACKEND=trtllm
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
Command:
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
âserved-model-name nemotron-super
âhost 0.0.0.0 --port 8000
âattention-backend TRITON_ATTN
âmoe-backend cutlass
âload-format safetensors
âkv-cache-dtype fp8
âgpu-memory-utilization 0.80
âmax-model-len 262144
âmamba_ssm_cache_dtype float32
âreasoning-parser nemotron_v3
âtool-call-parser qwen3_coder
âtensor-parallel-size 1
âdistributed-executor-backend ray
Error at model load (before serving begins):
âMergedColumnParallelLinearâ object has no attribute âworkspaceâ
Note on the saifgithub patch: we found GitHub - saifgithub/vllm-gb10-sm121: vLLM FP8 fix for NVIDIA GB10 / SM12.1 (DGX Spark) â enable_sm120_only to enable_sm120_family · GitHub which identifies the root cause as enable_sm120_only vs enable_sm120_family in the CUTLASS FP8 kernel â a one-line fix in two files. Not sure if thatâs been considered for upstream merge, but wanted to flag it in case itâs useful context.
Is there anything i have to set to use this model with agentic frameworks like Hermes, or Agent-Zero?
Cannot get it to work.
Hermes and Agent-Zero both end in a loop .
eg A0 is saying soomething like "âyou have sent the message again. Have to do something else..â and stay in the loop
EDIT. i am running it on dgx spark with the updated vLLM docker instructions from HF