NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

Where are all those auto-moding AI agents when we need them? :)

2 Likes

I tried to run it with the vllm nightly docker image and the flags from the readme, but it spewed several errors after starting up, like:

(EngineCore_DP0 pid=98) 2026-03-11 19:27:07,630 - WARNING - autotuner.py:496 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_cutla
ss_fused_moe_module.<locals>.MoERunner object at 0xfcb2cd124470> 15, due to failure while profiling: [TensorRT-LLM][ERROR] Assertion failed: Failed to initializ
e cutlass TMA WS grouped gemm. Error: Error Internal (/workspace/build/aot/generated/cutlass_instantiations/120/gemm_grouped/120/cutlass_kernel_file_gemm_groupe
d_sm120_M256_BS_group0.generated.cu:60)                                                                                                                         
(EngineCore_DP0 pid=98) 1       0xfcb299455ed4 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>
, std::allocator<char> > const&) + 84                                                                                                                           
(EngineCore_DP0 pid=98) 2       0xfcb2999b3074 /usr/local/lib/python3.12/dist-packages/flashinfer_jit_cache/jit_cache/fused_moe_120/fused_moe_120.so(+0x6d3074) 
[0xfcb2999b3074]                                                                                                                                                
(EngineCore_DP0 pid=98) 3       0xfcb2999b3294 void tensorrt_llm::kernels::cutlass_kernels_oss::tma_warp_specialized_generic_moe_gemm_kernelLauncher<cutlass::ar
ch::Sm120, __nv_fp4_e2m1, __nv_fp4_e2m1, __nv_bfloat16, void, tensorrt_llm::cutlass_extensions::EpilogueOpDefault, (tensorrt_llm::kernels::cutlass_kernels::TmaW
arpSpecializedGroupedGemmInput::EpilogueFusion)3, cute::tuple<cute::C<256>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> >, fals
e, false, false, false>(tensorrt_llm::kernels::cutlass_kernels::TmaWarpSpecializedGroupedGemmInput, int, int, CUstream_st*, int*, unsigned long*, cute::tuple<in
t, int, cute::C<1> >, cute::tuple<int, int, cute::C<1> >) + 84                                                                                                  
(EngineCore_DP0 pid=98) 4       0xfcb2995b0820 void tensorrt_llm::kernels::cutlass_kernels_oss::dispatchMoeGemmSelectClusterShapeTmaWarpSpecialized<cutlass::arc
h::Sm120, __nv_fp4_e2m1, __nv_fp4_e2m1, __nv_bfloat16, tensorrt_llm::cutlass_extensions::EpilogueOpDefault, (tensorrt_llm::kernels::cutlass_kernels::TmaWarpSpec
ializedGroupedGemmInput::EpilogueFusion)3, cute::tuple<cute::C<256>, cute::C<128>, cute::C<128> > >(tensorrt_llm::kernels::cutlass_kernels::TmaWarpSpecializedGr
oupedGemmInput, int, tensorrt_llm::cutlass_extensions::CutlassGemmConfig, int, CUstream_st*, int*, unsigned long*) + 192                                        
(EngineCore_DP0 pid=98) 5       0xfcb2995b0fd4 void tensorrt_llm::kernels::cutlass_kernels_oss::dispatchMoeGemmSelectTileShapeTmaWarpSpecialized<__nv_fp4_e2m1, 
__nv_fp4_e2m1, __nv_bfloat16, tensorrt_llm::cutlass_extensions::EpilogueOpDefault, (tensorrt_llm::kernels::cutlass_kernels::TmaWarpSpecializedGroupedGemmInput::
EpilogueFusion)3>(tensorrt_llm::kernels::cutlass_kernels::TmaWarpSpecializedGroupedGemmInput, int, tensorrt_llm::cutlass_extensions::CutlassGemmConfig, int, CUs
tream_st*, int*, unsigned long*) + 1108                                                            

And then eventually:

torch.AcceleratorError: CUDA error: an illegal instruction was encountered  

Which is the same failure I had with the nano version.

I have done the mod on my local machine & the recipe if you want me to PR.
With

command: |
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
–max-model-len {max_model_len}
–max-num-seqs {max_num_seqs}
–port {port} --host {host}
–trust-remote-code
–tensor-parallel-size {tensor_parallel}
–kv-cache-dtype fp8
–load-format fastsafetensors
–gpu-memory-utilization {gpu_memory_utilization}
–enable-auto-tool-choice
–tool-call-parser qwen3_coder
–reasoning-parser-plugin super_v3_reasoning_parser.py
–reasoning-parser super_v3

EDIT DONE A PR :

1 Like

Nvidia becomes the Apple of software, improving and perfecting what others do, NemoClaw Qwen3.5. Except for the hardware, they are king.

tested it via NIM API, it doesnt seem to be as good at coding as Qwen 3.5 122b or OSS 120B.

1 Like

I will try it with my picoclaw on local with the SPARK

1 Like

Underwhelming.

It looks like they are emphasizing a dramatic improvement in speed while being nearly the same quality as Qwen 3.5 122b. But of course, as we have become well accustomed to, NVFP4 is an inference decelerator on the DGX Spark. So disappointing.

See that light green bar – that’s supposed to be the increase in throughput with NVFP4. Although it looks oddly like a middle finger to me.

5 Likes

Have you or anyone tried it with TensorRT-LLM using the official deployment guidance I linked above?

vLLM is clearly not optimized for this yet. The architecture is interesting in multiple ways but especially that expert-parallel is strongly preferred over tensor-parallel.

For a single Spark user, it is difficult to find clear advantages compared to Qwen3.5-122B.

Compared to the Qwen3.5-122B GPTQ version, the speed difference is not very significant. While the generation quality feels much better than GPT-OSS-120B, I don’t really notice a difference compared to Qwen3.5-122B-A10B-GPTQ-Int4.

It also does not support vision, so an additional model such as an OCR model must be served separately. Because of this, it does not seem very appealing for a single Spark user. If it had been released around the same time as Nemotron-3-Nano, it would probably have been very impressive, but right now it feels a bit late.

However, considering that it supports a 1M context window with performance similar to Qwen3.5-122B, this appears to be a very strong advantage for environments where two or more Spark users set a 1M context and work with large amounts of information.

That said, when dealing with very large contexts, it might still be a bit slow on a dual-Spark setup.

yeah llama.cpp benchmarks are looking a couple t/s slower than I was getting for qwen 3.5 122b. will probably try it out later today just to experiment with the longer contrast window, but was really hoping for more speed. perhaps it will get better.

Confirmed. Running Super 120B NVFP4 on single Spark via vLLM — 16.7 tok/s, Marlin dequant (FP4→BF16), not native FP4 GEMM. Model fits (69.5 GiB), KV cache fits, tok/s is what Marlin gives you. That’s the real story.

Also tried TRT-LLM rc4 and rc5 today. Both dead on arrival — 120B hits the UMA ceiling during loading regardless of config. The RC4 ā€œNemotron fixā€ is for the 49B model, not 120B. Hard ceiling, not a tuning problem.

So: NVFP4 on Spark = memory compression format today, not a compute accelerator. Native FP4 GEMM on SM121 via CUTLASS 4.2+ is the actual unlock. Until then the chart is accurate in the way you described.

Hope someone can prove me wrong — soon :)

UPDATE: rc4 actually fails differently than rc5 — nemotron_h model type not registered, MPI worker crashes 16 seconds after start, before weight loading even begins. rc5 at least gets to the weight loading stage before OOMing. Two different failures. RC7 pulling now.

1 Like

Thanks for going through the effort. It’s becoming more and more obvious that NVIDIA has no intention of supporting NVFP4 on this hardware. I haven’t even bothered with TRT-LLM because I know they are not putting any support for us into that stack.

Qwen 3.5 122B is a better model in every way. Faster, smarter and has vision support. Ironically Intel gave us the best running quant.

As to the model you probably right, my ā€œhopeā€ was that this would be a working example for DB10 - that Nvidia would give us something….
We will see what future brings :)

I might give it a go and try TP=2 on TRT-LLM and see if I can force a true TP=2 sharded load path instead of repeating the single-node staging failure on each node.

BTW what did we expect — Config C — NVFP4, DGX Spark in a Advanced Deployment section on its very bottom :)

They do, and they are actually putting an effort into this currently, in part thanks to our activity here on this forum. We’ll see.

9 Likes

I know you are in the weeds so I take your word for it. It’s been a long time coming!

It seems like there is a lot of effort being put into TRT-LLM right now ahead of GTC. Looking at the main branch, there have been 29+ recent NemotronH commits, SM121 MoE fixes, Mamba optimizations, and a new TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL env var that just landed. RC7 is already out.

For those of us on DGX Spark trying to run Nemotron-3-Super-120B today — rc4 and rc5 both fail on single-node due to CPU staging during weight loading (~109 GiB RAM spike on a 128 GiB UMA system) - took me few hours…

vLLM works fine at 16.7 tok/s via Marlin dequant.

Looking forward to what comes out of GTC — feels like rc7 or a post-GTC release might be the one that actually lands cleanly on Spark for 120B.

3 Likes

I’m wondering if 595 driver and cuda 13.2 are going to move the needle at all. I’d try it now but I don’t want to be up all night lol

Just hold tight until it’s officially released to stable official update channels.

Thanks for commenting on this matter. With NVFP4 not working, it creates a bottleneck using the spark. NVIDIA should have given this more focused attention earlier on. Can you elaborate on what is being done to get NVFP4 running on Spark, or is there a place where I can follow the progress?

I am kind a new to these things. Can you tell me which docker image you use for the vLLM?