Nemotron 3 Super Improvements and Fixes

calexiuk · March 25, 2026, 5:01pm

We are pleased to announce several improvements since the initial Nemotron 3 Super launch on Mar 11, 2026.

What you need to know:

force_nonempty_content now works in streaming mode. Please note that the content field is only not empty in the last response from the server, where it duplicates all the content from the reasoning field (TensorRT-LLM: details).
Fixed the support for tool calling when using the qwen3coder tool parser in vLLM and TRT-LLM. With the fix it should return an object instead of a string when using the anyOf mode (vLLM: details; TensorRT-LLM: details).

The fixes are merged into releases:

NVIDIA NIM: https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/nemotron-3-super-120b-a12b/tags?version=latest

digiegg · March 26, 2026, 8:36pm

Thanks for the update. The tool calling fix in vLLM 0.18.0 is exactly what we need, but unfortunately 0.18.0+ introduces a regression on DGX Spark (SM12.1) with NVFP4 via Marlin:

‘MergedColumnParallelLinear’ object has no attribute ‘workspace’

We’re currently pinned back to 0.17.2rc1 to keep Nemotron Super running. Is there a timeline for when the SM12.1 NVFP4 Marlin path will be stable on 0.18.x? Or is TRT-LLM the recommended path for DGX Spark users who need working tool calls today?

askliar · April 7, 2026, 4:03pm

@digiegg Thank you so much for testing it out!
Could you, please, share more details on your setup: which image you’ve used, which command etc.?

I’ve tried running Marlin path in vLLM 0.18.0 and did not encounter any errors so would like to understand where the difference between our setups lies.

digiegg · April 7, 2026, 4:10pm

Reply to askliar:
Setup details:
∙ Hardware: NVIDIA DGX Spark GB10 (SM12.1, aarch64, 128GB unified memory)
∙ Image: custom build via eugr/spark-vllm-docker, vLLM 0.17.2rc1 (commit 9c7cab5eb)
∙ Model: Nemotron 3 Super 120B NVFP4
∙ Key flags: --attention-backend TRITON_ATTN, --moe-backend cutlass, --load-format safetensors, --kv-cache-dtype fp8, gpu_memory_utilization 0.7, max_model_len 262144
∙ Error on 0.18.x: ‘MergedColumnParallelLinear’ object has no attribute ‘workspace’ — occurs at model load, before serving begins
We ultimately sidestepped the issue by switching to Qwen3.5-35B-A3B-FP8 as our primary model, which doesn’t hit this path. We also found GitHub - saifgithub/vllm-gb10-sm121: vLLM FP8 fix for NVIDIA GB10 / SM12.1 (DGX Spark) — enable_sm120_only to enable_sm120_family · GitHub which identifies the root cause as enable_sm120_only vs enable_sm120_family in the CUTLASS FP8 kernel — a one-line fix in two files. Not sure if that patch has been considered for upstream merge.

askliar · April 7, 2026, 4:25pm

@digiegg Could you, please, share the entire vllm serve command?
Otherwise, it’s hard to tell what exactly is failing. One observation already is that you are not using Marlin based on the flags you’ve provided but, rather, CUTLASS + Triton.

but please, share the entire command and I’ll test it!

digiegg · April 7, 2026, 6:44pm

Thanks for following up! Here are both configs — working and failing — to help isolate the difference.

Hardware: DGX Spark GB10 (SM12.1, aarch64, 128GB unified memory)
Model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

✅ WORKING — vLLM 0.17.2rc1 (commit 9c7cab5eb), Marlin backend:

ENV:
VLLM_NVFP4_GEMM_BACKEND=marlin
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

Command:
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
–served-model-name nemotron-super
–host 0.0.0.0 --port 8000
–attention-backend TRITON_ATTN
–moe-backend marlin
–load-format safetensors
–kv-cache-dtype fp8
–gpu-memory-utilization 0.7
–max-model-len 262144
–mamba_ssm_cache_dtype float32
–reasoning-parser nemotron_v3
–tool-call-parser qwen3_coder
–tensor-parallel-size 1
–distributed-executor-backend ray

❌ FAILING — vLLM 0.18.1rc1 (commit 290809456, eugr spark-vllm-docker build, CUDA 13.2), CUTLASS backend:

ENV:
VLLM_FLASHINFER_ALLREDUCE_BACKEND=trtllm
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

Command:
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
–served-model-name nemotron-super
–host 0.0.0.0 --port 8000
–attention-backend TRITON_ATTN
–moe-backend cutlass
–load-format safetensors
–kv-cache-dtype fp8
–gpu-memory-utilization 0.80
–max-model-len 262144
–mamba_ssm_cache_dtype float32
–reasoning-parser nemotron_v3
–tool-call-parser qwen3_coder
–tensor-parallel-size 1
–distributed-executor-backend ray

Error at model load (before serving begins):
‘MergedColumnParallelLinear’ object has no attribute ‘workspace’

Note on the saifgithub patch: we found GitHub - saifgithub/vllm-gb10-sm121: vLLM FP8 fix for NVIDIA GB10 / SM12.1 (DGX Spark) — enable_sm120_only to enable_sm120_family · GitHub which identifies the root cause as enable_sm120_only vs enable_sm120_family in the CUTLASS FP8 kernel — a one-line fix in two files. Not sure if that’s been considered for upstream merge, but wanted to flag it in case it’s useful context.

quseteq · April 13, 2026, 1:11pm

Is there anything i have to set to use this model with agentic frameworks like Hermes, or Agent-Zero?
Cannot get it to work.
Hermes and Agent-Zero both end in a loop .
eg A0 is saying soomething like "“you have sent the message again. Have to do something else..” and stay in the loop

EDIT. i am running it on dgx spark with the updated vLLM docker instructions from HF

Topic		Replies	Views
NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 DGX Spark / GB10 nemotron	89	8722	March 31, 2026
Nemotron-3-Nano-30B-A3B-NVFP4 ultra-efficient NVFP4 precision version of Nemotron 3 Nano DGX Spark / GB10 jetson , nemotron	84	2953	March 20, 2026
New nvcr.io/nvidia/vllm:26.03.post1-py3 loads Nemotron-3-Super-120B-A12B-NVFP4 DGX Spark / GB10 nemotron	0	219	April 17, 2026
OpenClaw w/ Nemotron-3-Super NVFP4 TensorRT inference on Spark Discussion DGX Spark / GB10 nemotron	14	1339	April 9, 2026
Help running Nemotron 3 Nano 30B-A3B-FP8 on DGX Spark (GB10) DGX Spark / GB10 spark , nim , nemotron	42	3151	February 7, 2026
Tool calling not working with Nemotron-3-Super-120B-A12B-NVFP4 on DGX Spark (SM12.1) DGX Spark / GB10 nemotron	5	495	April 15, 2026
DGX Spark, Nemotron3, and NVFP4: Getting to 65+ tps DGX Spark / GB10 spark , nemotron , dgx	14	1899	December 22, 2025
Nemotron-3-Super 120B on GB10 — llama.cpp sm_121 build + Ollama GGUF incompatibility fix DGX Spark / GB10 Projects llama , nemotron	3	781	March 22, 2026
Running nvidia/nemotron-3-super on DGX spark DGX Spark / GB10 nemotron	12	953	March 26, 2026
Nemotron-3-Super NVFP4 via vLLM TP=2 on 2x DGX Spark — 24 tok/s (ABI fix for cu130/cu132 mismatch) DGX Spark / GB10 Projects spark , nemotron	1	315	March 26, 2026

Nemotron 3 Super Improvements and Fixes

Related topics