Nemotron-3-Super-120B-A12B-NVFP4 on single DGX Spark: 23.45 tok/s (spark-arena.com/ benhmarks)

Sharing benchmark results for Nemotron-3-Super-120B-A12B-NVFP4 on a single DGX Spark.

After a few weeks of tuning, two results are worth sharing:

  1. 23.45 tok/s: clean Spark Arena benchmark (tg128, no production services, 104 tests, 5h 49m, zero crashes or OOM) - llama-benchy
  • Stable from d0 to d100000 - no performance cliff observed
  1. 23.20 tok/s: same stack with Open WebUI + NemoHermes running alongside, measured with custom benchmark script
  • running stably as personal agent

Stack:

  • vLLM nightly (sha256:3dbe092e) — stable release behavior may vary
  • NVFP4 + Marlin + MTP speculative decoding
  • super_v3 reasoning parser

Links:

If someone reproduces this, finds issues, or beats it, I’d genuinely be interested in comparing notes.

Hi, firstly thx for hard work but nim-cache installation/setup/configuration is missing in installation scripts?

Hey @kafej666, good catch… you’re right, the nim-cache setup is missing from the install scripts. The model needs to be downloaded separately via NGC before running start.sh.

You can use ngc registry model download nvidia/nim/nemotron-3-super-120b-a12b:rl-030326-nvfp4 and point NIM_CACHE to the parent directory.

I’ve added a download_model.sh script to the setup/ folder to cover this gap.

For anyone else finding this thread, here is the command you need to pull the NVFP4 profile into your local cache:

docker run -it --rm
–runtime=nvidia --gpus all
-v “$HOME/nim-cache:/opt/nim/.cache”
-e NGC_API_KEY=$NGC_API_KEY
nvcr.io/nim/nvidia/nemotron-3-super-120b-a12b:latest``
download-to-cache --profiles 3b37a659a22c9390abe7b16aeb29c301c2e9c0e12e5b0fa76171681df31930e0

Once that finishes, the start.sh script will automatically pick up the rl-030326-nvfp4 snapshot.

Let me know if you run any other issues getting it running!

Hey @0rand
Good question. Here are a few key differences I can think of compared to a standard setup:

  1. Raw nightly vllm-openai image (sha256:3dbe092e) instead of the official NIM container. I found the NIM wrapper clamps gpu_memory_utilization to 0.50 on UMA hardware, which immediately causes an OOM for a 75GB model on 128GB DGX Spark.
  2. –tool-call-parser qwen3_coder instead of hermes I tried earlier. The hermes parser crashed the NVIDIA vLLM fork with a 500 error for me.
  3. VLLM_NVFP4_GEMM_BACKEND=marlin env var explicitly set.
  4. MTP speculative decoding enabled.

Full launch recipe is in the repo: https://github.com/airawatraj/dgx-spark-nemotron-super-agent
The docker/start.sh has the exact command with all flags.

Sharing a quick follow-up to my earlier Nemotron post.

RedHatAI/Qwen3.6-35B-A3B-NVFP4 on a single DGX Spark via Atlas: 218.85 tok/s (tg128, c1).

This was a much lighter setup effort than the Nemotron run. The speed comes from Atlas’s native NVFP4 kernels + MTP K=2 - not weeks of tuning.

A few honest notes:

If someone pushes it further, would love to compare notes.

Very interesting that it behaves like the opposite of other models where higher concurrency means more Tok/sec.

When I read your initial speed, I was ready to test it, but the dropout to 1/2 of the speed at concurrency=2 pushed me away. If there a way to optimize to at least maintain the speed at 2x and 4x, this would definitely be a huge success.

Can’t wait for more NVFP4 optimizations! :)

You’re right, and it’s a fair observation. The concurrency dropoff is real.

Looking at the data more carefully, this appears to be Atlas/speculative decoding behaviour under concurrency rather than a memory ceiling issue - other submissions with higher gpu_memory_utilization show the same c1 to c2 pattern.

The 0.75 memory ceiling in my setup mainly explains the long-context cliff at 100k+, not the short-context concurrency dropoff.

Would be genuinely interesting to see if someone finds a configuration that holds better at c2 and beyond.