NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

AoE · March 11, 2026, 7:30pm

Where are all those auto-moding AI agents when we need them? :)

DannyTup · March 11, 2026, 7:33pm

I tried to run it with the vllm nightly docker image and the flags from the readme, but it spewed several errors after starting up, like:

(EngineCore_DP0 pid=98) 2026-03-11 19:27:07,630 - WARNING - autotuner.py:496 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_cutla
ss_fused_moe_module.<locals>.MoERunner object at 0xfcb2cd124470> 15, due to failure while profiling: [TensorRT-LLM][ERROR] Assertion failed: Failed to initializ
e cutlass TMA WS grouped gemm. Error: Error Internal (/workspace/build/aot/generated/cutlass_instantiations/120/gemm_grouped/120/cutlass_kernel_file_gemm_groupe
d_sm120_M256_BS_group0.generated.cu:60)                                                                                                                         
(EngineCore_DP0 pid=98) 1       0xfcb299455ed4 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>
, std::allocator<char> > const&) + 84                                                                                                                           
(EngineCore_DP0 pid=98) 2       0xfcb2999b3074 /usr/local/lib/python3.12/dist-packages/flashinfer_jit_cache/jit_cache/fused_moe_120/fused_moe_120.so(+0x6d3074) 
[0xfcb2999b3074]                                                                                                                                                
(EngineCore_DP0 pid=98) 3       0xfcb2999b3294 void tensorrt_llm::kernels::cutlass_kernels_oss::tma_warp_specialized_generic_moe_gemm_kernelLauncher<cutlass::ar
ch::Sm120, __nv_fp4_e2m1, __nv_fp4_e2m1, __nv_bfloat16, void, tensorrt_llm::cutlass_extensions::EpilogueOpDefault, (tensorrt_llm::kernels::cutlass_kernels::TmaW
arpSpecializedGroupedGemmInput::EpilogueFusion)3, cute::tuple<cute::C<256>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> >, fals
e, false, false, false>(tensorrt_llm::kernels::cutlass_kernels::TmaWarpSpecializedGroupedGemmInput, int, int, CUstream_st*, int*, unsigned long*, cute::tuple<in
t, int, cute::C<1> >, cute::tuple<int, int, cute::C<1> >) + 84                                                                                                  
(EngineCore_DP0 pid=98) 4       0xfcb2995b0820 void tensorrt_llm::kernels::cutlass_kernels_oss::dispatchMoeGemmSelectClusterShapeTmaWarpSpecialized<cutlass::arc
h::Sm120, __nv_fp4_e2m1, __nv_fp4_e2m1, __nv_bfloat16, tensorrt_llm::cutlass_extensions::EpilogueOpDefault, (tensorrt_llm::kernels::cutlass_kernels::TmaWarpSpec
ializedGroupedGemmInput::EpilogueFusion)3, cute::tuple<cute::C<256>, cute::C<128>, cute::C<128> > >(tensorrt_llm::kernels::cutlass_kernels::TmaWarpSpecializedGr
oupedGemmInput, int, tensorrt_llm::cutlass_extensions::CutlassGemmConfig, int, CUstream_st*, int*, unsigned long*) + 192                                        
(EngineCore_DP0 pid=98) 5       0xfcb2995b0fd4 void tensorrt_llm::kernels::cutlass_kernels_oss::dispatchMoeGemmSelectTileShapeTmaWarpSpecialized<__nv_fp4_e2m1, 
__nv_fp4_e2m1, __nv_bfloat16, tensorrt_llm::cutlass_extensions::EpilogueOpDefault, (tensorrt_llm::kernels::cutlass_kernels::TmaWarpSpecializedGroupedGemmInput::
EpilogueFusion)3>(tensorrt_llm::kernels::cutlass_kernels::TmaWarpSpecializedGroupedGemmInput, int, tensorrt_llm::cutlass_extensions::CutlassGemmConfig, int, CUs
tream_st*, int*, unsigned long*) + 1108

And then eventually:

torch.AcceleratorError: CUDA error: an illegal instruction was encountered

Which is the same failure I had with the nano version.

giraudremi92 · March 11, 2026, 7:43pm

I have done the mod on my local machine & the recipe if you want me to PR.
With

command: |
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
–max-model-len {max_model_len}
–max-num-seqs {max_num_seqs}
–port {port} --host {host}
–trust-remote-code
–tensor-parallel-size {tensor_parallel}
–kv-cache-dtype fp8
–load-format fastsafetensors
–gpu-memory-utilization {gpu_memory_utilization}
–enable-auto-tool-choice
–tool-call-parser qwen3_coder
–reasoning-parser-plugin super_v3_reasoning_parser.py
–reasoning-parser super_v3

EDIT DONE A PR :

vedcsolution · March 11, 2026, 8:02pm

Nvidia becomes the Apple of software, improving and perfecting what others do, NemoClaw Qwen3.5. Except for the hardware, they are king.

arctic.gus · March 11, 2026, 8:42pm

tested it via NIM API, it doesnt seem to be as good at coding as Qwen 3.5 122b or OSS 120B.

giraudremi92 · March 11, 2026, 8:48pm

I will try it with my picoclaw on local with the SPARK

josephbreda · March 11, 2026, 10:30pm

Underwhelming.

It looks like they are emphasizing a dramatic improvement in speed while being nearly the same quality as Qwen 3.5 122b. But of course, as we have become well accustomed to, NVFP4 is an inference decelerator on the DGX Spark. So disappointing.

See that light green bar – that’s supposed to be the increase in throughput with NVFP4. Although it looks oddly like a middle finger to me.

jwarner · March 11, 2026, 10:49pm

Have you or anyone tried it with TensorRT-LLM using the official deployment guidance I linked above?

vLLM is clearly not optimized for this yet. The architecture is interesting in multiple ways but especially that expert-parallel is strongly preferred over tensor-parallel.

gpieceoffice · March 11, 2026, 11:53pm

For a single Spark user, it is difficult to find clear advantages compared to Qwen3.5-122B.

Compared to the Qwen3.5-122B GPTQ version, the speed difference is not very significant. While the generation quality feels much better than GPT-OSS-120B, I don’t really notice a difference compared to Qwen3.5-122B-A10B-GPTQ-Int4.

It also does not support vision, so an additional model such as an OCR model must be served separately. Because of this, it does not seem very appealing for a single Spark user. If it had been released around the same time as Nemotron-3-Nano, it would probably have been very impressive, but right now it feels a bit late.

However, considering that it supports a 1M context window with performance similar to Qwen3.5-122B, this appears to be a very strong advantage for environments where two or more Spark users set a 1M context and work with large amounts of information.

That said, when dealing with very large contexts, it might still be a bit slow on a dual-Spark setup.

andrewc_actual · March 11, 2026, 11:55pm

yeah llama.cpp benchmarks are looking a couple t/s slower than I was getting for qwen 3.5 122b. will probably try it out later today just to experiment with the longer contrast window, but was really hoping for more speed. perhaps it will get better.

github.com/ggml-org/llama.cpp

benches/nemotron/nemotron-dgx-spark.md

master

# NVIDIA DGX Spark

## System info

```bash
uname --all
Linux spark-17ed 6.11.0-1016-nvidia #16-Ubuntu SMP PREEMPT_DYNAMIC Sun Sep 21 16:52:46 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux

g++ --version
g++ (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0

nvidia-smi
Fri Mar  6 11:39:45 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|

This file has been truncated. show original

adi-sonusflow · March 12, 2026, 1:15am

Confirmed. Running Super 120B NVFP4 on single Spark via vLLM — 16.7 tok/s, Marlin dequant (FP4→BF16), not native FP4 GEMM. Model fits (69.5 GiB), KV cache fits, tok/s is what Marlin gives you. That’s the real story.

Also tried TRT-LLM rc4 and rc5 today. Both dead on arrival — 120B hits the UMA ceiling during loading regardless of config. The RC4 “Nemotron fix” is for the 49B model, not 120B. Hard ceiling, not a tuning problem.

So: NVFP4 on Spark = memory compression format today, not a compute accelerator. Native FP4 GEMM on SM121 via CUTLASS 4.2+ is the actual unlock. Until then the chart is accurate in the way you described.

Hope someone can prove me wrong — soon :)

UPDATE: rc4 actually fails differently than rc5 — nemotron_h model type not registered, MPI worker crashes 16 seconds after start, before weight loading even begins. rc5 at least gets to the weight loading stage before OOMing. Two different failures. RC7 pulling now.

josephbreda · March 12, 2026, 1:19am

Thanks for going through the effort. It’s becoming more and more obvious that NVIDIA has no intention of supporting NVFP4 on this hardware. I haven’t even bothered with TRT-LLM because I know they are not putting any support for us into that stack.

Qwen 3.5 122B is a better model in every way. Faster, smarter and has vision support. Ironically Intel gave us the best running quant.

adi-sonusflow · March 12, 2026, 1:28am

As to the model you probably right, my “hope” was that this would be a working example for DB10 - that Nvidia would give us something….
We will see what future brings :)

I might give it a go and try TP=2 on TRT-LLM and see if I can force a true TP=2 sharded load path instead of repeating the single-node staging failure on each node.

BTW what did we expect — Config C — NVFP4, DGX Spark in a Advanced Deployment section on its very bottom :)

eugr · March 12, 2026, 1:31am

They do, and they are actually putting an effort into this currently, in part thanks to our activity here on this forum. We’ll see.

josephbreda · March 12, 2026, 1:37am

I know you are in the weeds so I take your word for it. It’s been a long time coming!

adi-sonusflow · March 12, 2026, 2:07am

It seems like there is a lot of effort being put into TRT-LLM right now ahead of GTC. Looking at the main branch, there have been 29+ recent NemotronH commits, SM121 MoE fixes, Mamba optimizations, and a new TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL env var that just landed. RC7 is already out.

For those of us on DGX Spark trying to run Nemotron-3-Super-120B today — rc4 and rc5 both fail on single-node due to CPU staging during weight loading (~109 GiB RAM spike on a 128 GiB UMA system) - took me few hours…

vLLM works fine at 16.7 tok/s via Marlin dequant.

Looking forward to what comes out of GTC — feels like rc7 or a post-GTC release might be the one that actually lands cleanly on Spark for 120B.

tenari · March 12, 2026, 2:50am

I’m wondering if 595 driver and cuda 13.2 are going to move the needle at all. I’d try it now but I don’t want to be up all night lol

joshua.dale.warner · March 12, 2026, 3:04am

Just hold tight until it’s officially released to stable official update channels.

thomas.developer1 · March 12, 2026, 5:05am

Thanks for commenting on this matter. With NVFP4 not working, it creates a bottleneck using the spark. NVIDIA should have given this more focused attention earlier on. Can you elaborate on what is being done to get NVFP4 running on Spark, or is there a place where I can follow the progress?

LazyTrader · March 12, 2026, 6:15am

I am kind a new to these things. Can you tell me which docker image you use for the vLLM?

Topic		Replies	Views
Nemotron-3-Nano-30B-A3B-NVFP4 ultra-efficient NVFP4 precision version of Nemotron 3 Nano DGX Spark / GB10 jetson , nemotron	84	2679	March 20, 2026
OpenClaw w/ Nemotron-3-Super NVFP4 TensorRT inference on Spark Discussion DGX Spark / GB10 nemotron	13	1048	April 2, 2026
DGX Spark, Nemotron3, and NVFP4: Getting to 65+ tps DGX Spark / GB10 spark , nemotron , dgx	14	1652	December 22, 2025
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	5839	March 28, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2643	December 31, 2025
Nemotron-3-Super 120B on GB10 — llama.cpp sm_121 build + Ollama GGUF incompatibility fix DGX Spark / GB10 Projects llama , nemotron	3	673	March 22, 2026
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	71	3504	April 3, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	27	2310	March 26, 2026
Help running Nemotron 3 Nano 30B-A3B-FP8 on DGX Spark (GB10) DGX Spark / GB10 spark , nim , nemotron	42	2976	February 7, 2026
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	210	6851	April 4, 2026

NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

Related topics