DGX Spark, Nemotron3, and NVFP4: Getting to 65+ tps

Getting an NVFP4 quant of the new nemotron3-nano to work on the DGX Spark was challenging. However, copy and paste this and it should “just work”:

docker run --rm -it --gpus all --ipc=host -p 8000:8000 \
  -e VLLM_FLASHINFER_MOE_BACKEND=latency \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  avarok/vllm-dgx-spark:v11 \
  serve cybermotaz/nemotron3-nano-nvfp4-w4a16 \
  --quantization modelopt_fp4 \
  --kv-cache-dtype fp8 \
  --trust-remote-code \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.85 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser deepseek_r1

For the debugging journey, see the corresponding medium article: https://blog.thomaspbraun.com/dgx-spark-nemotron3-and-nvfp4-getting-to-65-tps-8c5569025eb6

4 Likes

The article say nemotron nano 3 has 128k context window, but max is actually 1M context window.

Thanks, so our issues with official FP8 quant are not related to the model itself, but it’s either a bug in the corresponding Flashinfer kernels, or the model itself is broken.

This one works just fine with my Docker, and even with pre-built nightly wheels.

It does seem to have some config issues as there are some scales in the weights that are not specified in model config, so it spits out a few warnings, plus flashinfer throws a few errors, but after that it loads just fine and seems to be working with 67 t/s which is still slower than it should be for this size, but that’s because NVFP4 support is still not working well.

Now, interestingly enough, if you read the model description, they recommend using VLLM_USE_FLASHINFER_MOE_FP4=1 and VLLM_FLASHINFER_MOE_BACKEND=throughput, the latter being the opposite from your findings.

Looks like VLLM_USE_FLASHINFER_MOE_FP4=1 is the key here (and NVIDIA recommends a similar parameter for their FP8 model), otherwise the model doesn’t load.

Also, Nemotron uses their own reasoning parser - you need that for clients to receive proper thinking blocks.

Here is what works on my builds:

Download the parser first:

wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8/resolve/main/nano_v3_reasoning_parser.py

Then run:

VLLM_USE_FLASHINFER_MOE_FP4=1 vllm serve cybermotaz/nemotron3-nano-nvfp4-w4a16 \
	--trust-remote-code  \
	--kv-cache-dtype fp8 \
	--load-format fastsafetensors \
	--gpu-memory-utilization 0.7 \
	--host 0.0.0.0 --port 8888 \
	--enable-auto-tool-choice \
	--tool-call-parser qwen3_coder \
	--reasoning-parser-plugin nano_v3_reasoning_parser.py \
	--reasoning-parser nano_v3

Bench for 1 request:

============ Serving Benchmark Result ============
Successful requests:                     1
Failed requests:                         0
Benchmark duration (s):                  1.83
Total input tokens:                      12
Total generated tokens:                  123
Request throughput (req/s):              0.55
Output token throughput (tok/s):         67.16
Peak output token throughput (tok/s):    66.00
Peak concurrent requests:                1.00
Total token throughput (tok/s):          73.71
---------------Time to First Token----------------
Mean TTFT (ms):                          51.97
Median TTFT (ms):                        51.97
P99 TTFT (ms):                           51.97
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.58
Median TPOT (ms):                        14.58
P99 TPOT (ms):                           14.58
---------------Inter-token Latency----------------
Mean ITL (ms):                           14.58
Median ITL (ms):                         14.51
P99 ITL (ms):                            16.99
==================================================
3 Likes

I tried @tbraun96 ‘s code up the top of this page and it did not “just work.” I got a bunch of errors including “torch.AcceleratorError: CUDA error: misaligned address” (several times). Not sure what I’m missing since I’m using his Docker container avarok/vllm-dgx-spark:v11

@eugr when you say “This one works just fine with my Docker, and even with pre-built nightly wheels.” which Docker are you referring to?

This one: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks

But you need to launch with different parameters than in his post:

Inside the container, download the parser first:

wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8/resolve/main/nano_v3_reasoning_parser.py

Then run:

VLLM_USE_FLASHINFER_MOE_FP4=1 vllm serve cybermotaz/nemotron3-nano-nvfp4-w4a16 \
	--trust-remote-code  \
	--kv-cache-dtype fp8 \
	--load-format fastsafetensors \
	--gpu-memory-utilization 0.7 \
	--host 0.0.0.0 --port 8888 \
	--enable-auto-tool-choice \
	--tool-call-parser qwen3_coder \
	--reasoning-parser-plugin nano_v3_reasoning_parser.py \
	--reasoning-parser nano_v3

It will still spit out a bunch of warnings, as the model itself seems to be a bit broken, but it will work. I wouldn’t consider this particular model quant for anything important though, not yet.

1 Like

Ok, thanks. That’s what I thought but I didn’t think he was running on two sparks. I’ll see if yours works with the changes given.

He was running on a single Spark. Me too - there is not much gain in running this model on two sparks anyway. You don’t have to use my container in the cluster, it works just as well on a single Spark.

If you haven’t used my builds for a while now, make sure you pull the latest changes and rebuild the container. There is also a new way of building by pulling new CUDA 13 nightly wheels, which speeds things up considerably. And lots of other improvements - make sure to read the README.

1 Like

Ran your container on a single Spark. Worked like a charm. cybermotaz/nemotron3-nano-nvfp4-w4a16 at 67 tokens/sec.

By the way, have you gotten GLM-4.6V working? I tried with llama.cpp but no luck.

Thanks for all your help. Very much appreciated.

Yes, but you need to install transformers 5.0.0rc1 to use it. I haven’t baked it into my build yet as it was causing issues with other models.

You will need to launch the container on both nodes first.

Once inside the container, install transformers 5 first (on both nodes):

pip install transformers>=5.0.0 --pre -U

Then, for the original FP8 version, run:

vllm serve zai-org/GLM-4.6V-FP8 \
  --tool-call-parser glm45 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --allowed-local-media-path / \
  --mm-encoder-tp-mode data \
  -tp 2 \
  --gpu-memory-utilization 0.7 \
  --distributed-executor-backend ray \
  --host 0.0.0.0 \
  --port 8888

You should get around 23 t/s with this when running on the cluster.

Or you can run AWQ 4-bit quant:

vllm serve cyankiwi/GLM-4.6V-AWQ-4bit \
  --tool-call-parser glm45 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --allowed-local-media-path / \
  --mm-encoder-tp-mode data \
  -tp 2 \
  --gpu-memory-utilization 0.7 \
  --distributed-executor-backend ray \
  --host 0.0.0.0 \
  --port 8888

This will give you 32 t/s on the cluster.

I think I’ll add a parameter to my build to enable transformers 5.

1 Like

Great. I will give it a try when I have time. Hope it’s worth it.

I should add I never would have figured out all those arguments myself.

Good thing you don’t have to :) Usually, the proper arguments are provided on the model’s Huggingface page. At least on the original model page, and most of them are applicable to all the quantizations too. Also, vLLM has recipes page for different models.

1 Like

OK, I’ve added ability to use transformers v5 in Docker builds.
I suggest keeping this build image separate from the regular one in case of incompatibilities. It seems to work much better with other models now, but there may be cases when it breaks something.

To build (using fast wheels build here) and distribute the image to your cluster node, you can use this command. You can name your image differently, I just use vllm-node-whl-tf5 here.

./build-and-copy.sh -t vllm-node-whl-tf5 --use-wheels --pre-tf --pre-flashinfer -c

Then, to run the model on all nodes of the cluster, you can use the new convenience script on the head node:

./launch-cluster.sh  \
        -t vllm-node-whl-tf5 \
        exec vllm serve zai-org/GLM-4.6V-FP8 \
        --tool-call-parser glm45 \
        --reasoning-parser glm45 \
        --enable-auto-tool-choice \
        --allowed-local-media-path / \
        --mm-encoder-tp-mode data \
        -tp 2 \
        --gpu-memory-utilization 0.7 \
        --distributed-executor-backend ray \
        --host 0.0.0.0 \
        --port 8000 \
        --load-format fastsafetensors

It will autodiscover your interface configuration, start cluster, and if everything is properly configured, launch the model. When you quit the process, it will shut down the cluster automatically.

1 Like

Looks like –pre-tf is not a valid parameter to build-and-copy.sh. I’m running it now without that flag. When you get back from your boat perhaps you could let me know what you think.

it’s --pre-tf, and you need to pull the latest changes from the repo first.
I test everything prior to posting here ;)

GLM 4.6V won’t work without that flag (unless you go into the container and install transformers 5 manually).

I did use --pre-tf but the auto correct in the browser turned the two n-dashes into an m-dash. However I have just run git pull origin main and see that 5 files got updated, so hopefully it will work.

Now if we could just teach Nvidia employees to test their stuff before posting it for us to use, we would all be better off.

OK, I reran the code after updating the git. It took about 15 minutes to get going (running on two Sparks), which seemed a bit disappointing. There were a number of warnings, but I’m used to that and don’t want to chase them down. However the tokens per second isn’t too bad (about 23) and based on a small number of queries that I use to test models I was impressed with the quality of the answers. Among the best I’ve seen.

Thanks again for your help.

1 Like