Sharing benchmark results for Nemotron-3-Super-120B-A12B-NVFP4 on a single DGX Spark.
After a few weeks of tuning, two results are worth sharing:
- 23.45 tok/s: clean Spark Arena benchmark (tg128, no production services, 104 tests, 5h 49m, zero crashes or OOM) - llama-benchy
- Stable from d0 to d100000 - no performance cliff observed
- 23.20 tok/s: same stack with Open WebUI + NemoHermes running alongside, measured with custom benchmark script
- running stably as personal agent
Stack:
- vLLM nightly (sha256:3dbe092e) — stable release behavior may vary
- NVFP4 + Marlin + MTP speculative decoding
- super_v3 reasoning parser
Links:
If someone reproduces this, finds issues, or beats it, I’d genuinely be interested in comparing notes.
Hi, firstly thx for hard work but nim-cache installation/setup/configuration is missing in installation scripts?
Hey @kafej666, good catch… you’re right, the nim-cache setup is missing from the install scripts. The model needs to be downloaded separately via NGC before running start.sh.
You can use ngc registry model download nvidia/nim/nemotron-3-super-120b-a12b:rl-030326-nvfp4 and point NIM_CACHE to the parent directory.
I’ve added a download_model.sh script to the setup/ folder to cover this gap.
For anyone else finding this thread, here is the command you need to pull the NVFP4 profile into your local cache:
docker run -it --rm
–runtime=nvidia --gpus all
-v “$HOME/nim-cache:/opt/nim/.cache”
-e NGC_API_KEY=$NGC_API_KEY
nvcr.io/nim/nvidia/nemotron-3-super-120b-a12b:latest``
download-to-cache --profiles 3b37a659a22c9390abe7b16aeb29c301c2e9c0e12e5b0fa76171681df31930e0
Once that finishes, the start.sh script will automatically pick up the rl-030326-nvfp4 snapshot.
Let me know if you run any other issues getting it running!
Hey @0rand
Good question. Here are a few key differences I can think of compared to a standard setup:
- Raw nightly vllm-openai image (sha256:3dbe092e) instead of the official NIM container. I found the NIM wrapper clamps gpu_memory_utilization to 0.50 on UMA hardware, which immediately causes an OOM for a 75GB model on 128GB DGX Spark.
- –tool-call-parser qwen3_coder instead of hermes I tried earlier. The hermes parser crashed the NVIDIA vLLM fork with a 500 error for me.
- VLLM_NVFP4_GEMM_BACKEND=marlin env var explicitly set.
- MTP speculative decoding enabled.
Full launch recipe is in the repo: https://github.com/airawatraj/dgx-spark-nemotron-super-agent
The docker/start.sh has the exact command with all flags.
Sharing a quick follow-up to my earlier Nemotron post.
RedHatAI/Qwen3.6-35B-A3B-NVFP4 on a single DGX Spark via Atlas: 218.85 tok/s (tg128, c1).
This was a much lighter setup effort than the Nemotron run. The speed comes from Atlas’s native NVFP4 kernels + MTP K=2 - not weeks of tuning.
A few honest notes:
If someone pushes it further, would love to compare notes.
Very interesting that it behaves like the opposite of other models where higher concurrency means more Tok/sec.
When I read your initial speed, I was ready to test it, but the dropout to 1/2 of the speed at concurrency=2 pushed me away. If there a way to optimize to at least maintain the speed at 2x and 4x, this would definitely be a huge success.
Can’t wait for more NVFP4 optimizations! :)
You’re right, and it’s a fair observation. The concurrency dropoff is real.
Looking at the data more carefully, this appears to be Atlas/speculative decoding behaviour under concurrency rather than a memory ceiling issue - other submissions with higher gpu_memory_utilization show the same c1 to c2 pattern.
The 0.75 memory ceiling in my setup mainly explains the long-context cliff at 100k+, not the short-context concurrency dropoff.
Would be genuinely interesting to see if someone finds a configuration that holds better at c2 and beyond.