Nemotron 3 Super Improvements and Fixes

We are pleased to announce several improvements since the initial Nemotron 3 Super launch on Mar 11, 2026.

What you need to know:

  • force_nonempty_content now works in streaming mode. Please note that the content field is only not empty in the last response from the server, where it duplicates all the content from the reasoning field (TensorRT-LLM: details).

  • Fixed the support for tool calling when using the qwen3coder tool parser in vLLM and TRT-LLM. With the fix it should return an object instead of a string when using the anyOf mode (vLLM: details; TensorRT-LLM: details).

The fixes are merged into releases:

NVIDIA NIM: https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/nemotron-3-super-120b-a12b/tags?version=latest

Thanks for the update. The tool calling fix in vLLM 0.18.0 is exactly what we need, but unfortunately 0.18.0+ introduces a regression on DGX Spark (SM12.1) with NVFP4 via Marlin:

‘MergedColumnParallelLinear’ object has no attribute ‘workspace’

We’re currently pinned back to 0.17.2rc1 to keep Nemotron Super running. Is there a timeline for when the SM12.1 NVFP4 Marlin path will be stable on 0.18.x? Or is TRT-LLM the recommended path for DGX Spark users who need working tool calls today?

@digiegg Thank you so much for testing it out!
Could you, please, share more details on your setup: which image you’ve used, which command etc.?

I’ve tried running Marlin path in vLLM 0.18.0 and did not encounter any errors so would like to understand where the difference between our setups lies.

Reply to askliar:
Setup details:
∙ Hardware: NVIDIA DGX Spark GB10 (SM12.1, aarch64, 128GB unified memory)
∙ Image: custom build via eugr/spark-vllm-docker, vLLM 0.17.2rc1 (commit 9c7cab5eb)
∙ Model: Nemotron 3 Super 120B NVFP4
∙ Key flags: --attention-backend TRITON_ATTN, --moe-backend cutlass, --load-format safetensors, --kv-cache-dtype fp8, gpu_memory_utilization 0.7, max_model_len 262144
∙ Error on 0.18.x: ‘MergedColumnParallelLinear’ object has no attribute ‘workspace’ — occurs at model load, before serving begins
We ultimately sidestepped the issue by switching to Qwen3.5-35B-A3B-FP8 as our primary model, which doesn’t hit this path. We also found GitHub - saifgithub/vllm-gb10-sm121: vLLM FP8 fix for NVIDIA GB10 / SM12.1 (DGX Spark) — enable_sm120_only to enable_sm120_family · GitHub which identifies the root cause as enable_sm120_only vs enable_sm120_family in the CUTLASS FP8 kernel — a one-line fix in two files. Not sure if that patch has been considered for upstream merge.

@digiegg Could you, please, share the entire vllm serve command?
Otherwise, it’s hard to tell what exactly is failing. One observation already is that you are not using Marlin based on the flags you’ve provided but, rather, CUTLASS + Triton.

but please, share the entire command and I’ll test it!

Thanks for following up! Here are both configs — working and failing — to help isolate the difference.

Hardware: DGX Spark GB10 (SM12.1, aarch64, 128GB unified memory)
Model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4


✅ WORKING — vLLM 0.17.2rc1 (commit 9c7cab5eb), Marlin backend:

ENV:
VLLM_NVFP4_GEMM_BACKEND=marlin
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

Command:
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
–served-model-name nemotron-super
–host 0.0.0.0 --port 8000
–attention-backend TRITON_ATTN
–moe-backend marlin
–load-format safetensors
–kv-cache-dtype fp8
–gpu-memory-utilization 0.7
–max-model-len 262144
–mamba_ssm_cache_dtype float32
–reasoning-parser nemotron_v3
–tool-call-parser qwen3_coder
–tensor-parallel-size 1
–distributed-executor-backend ray


❌ FAILING — vLLM 0.18.1rc1 (commit 290809456, eugr spark-vllm-docker build, CUDA 13.2), CUTLASS backend:

ENV:
VLLM_FLASHINFER_ALLREDUCE_BACKEND=trtllm
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

Command:
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
–served-model-name nemotron-super
–host 0.0.0.0 --port 8000
–attention-backend TRITON_ATTN
–moe-backend cutlass
–load-format safetensors
–kv-cache-dtype fp8
–gpu-memory-utilization 0.80
–max-model-len 262144
–mamba_ssm_cache_dtype float32
–reasoning-parser nemotron_v3
–tool-call-parser qwen3_coder
–tensor-parallel-size 1
–distributed-executor-backend ray

Error at model load (before serving begins):
‘MergedColumnParallelLinear’ object has no attribute ‘workspace’


Note on the saifgithub patch: we found GitHub - saifgithub/vllm-gb10-sm121: vLLM FP8 fix for NVIDIA GB10 / SM12.1 (DGX Spark) — enable_sm120_only to enable_sm120_family · GitHub which identifies the root cause as enable_sm120_only vs enable_sm120_family in the CUTLASS FP8 kernel — a one-line fix in two files. Not sure if that’s been considered for upstream merge, but wanted to flag it in case it’s useful context.

Is there anything i have to set to use this model with agentic frameworks like Hermes, or Agent-Zero?
Cannot get it to work.
Hermes and Agent-Zero both end in a loop .
eg A0 is saying soomething like "“you have sent the message again. Have to do something else..” and stay in the loop

EDIT. i am running it on dgx spark with the updated vLLM docker instructions from HF