NVIDIA folks -- where is this promised nvfp4 speedup?

Today, NVIDIA has announced that with NVFP4 support, the DGX Spark delivers up to a 2.5x boost in the Qwen 235B model (two DGX Sparks paired).

Boost from what, running on a CPU? Early adopters have been patiently waiting for proper software support. It makes it even worse to read this marketing spin – which probably cost more than a few developers dedicated to getting the stack optimized.

Others on the forum – am I wrong here?

3 Likes

The graph shows the boost for TRT-LLM for Qwen 235B on two Sparks.

Today, NVIDIA has announced that with NVFP4 support, the DGX Spark delivers up to a 2.5x boost in the Qwen 235B model (two DGX Sparks paired).

So may be it’s time to test the latest TensorRT LLM version again.

or

1 Like

I’m going to do the same

I wonder why they didn’t just post the performance numbers.
I wouldn’t be surprised if they just achieved the same performance we can already get in vLLM.

If anyone tries the newest TRTLLM before me, please post the benchmarks, preferably using llama-benchy.

Just ran two models on it. I can bench it, but can tell right away it’s slower than vllm. The “2.5x” must be compared to full weight model on the same stack?

1 Like

Well, as expected… No wonder they pulled a tactic from Apple marketing playbook and didn’t give any specific performance numbers.

Well, I ran llama_benchy and it went like this:

uv run llama-benchy   --base-url http://localhost:8355/v1   --model openai/gpt-oss-120b   --depth 0 4096 8192 16384 32768   --latency-mode generation
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
llama-benchy (0.1.1)
Date: 2026-01-09 14:47:15
Benchmarking model: openai/gpt-oss-120b at http://localhost:8355/v1
Loading text from cache: /home/joseph/.cache/llama-benchy/cc6a0b5782734ee3b9069aa3b64cc62c.txt
Total tokens available in text corpus: 140865
Warming up...
Warmup (User only) complete (no usage stats found).
Warmup (System+Empty) complete (no usage stats found).
Measuring latency using mode: generation...
Average latency (generation): 121.13 ms
Running test: pp=2048, tg=32, depth=0
Error: 400 - {"object":"error","message":"error downloading or loading vocab file: failed to download or load vocab file","type":"internal_error","param":null,"code":400}
Error: 400 - {"object":"error","message":"error downloading or loading vocab file: failed to download or load vocab file","type":"internal_error","param":null,"code":400}
Error: 400 - {"object":"error","message":"error downloading or loading vocab file: failed to download or load vocab file","type":"internal_error","param":null,"code":400}

And meanwhile TensorRT-LLM (following verbatim arguments from playbook except increased context length to 64K) said this:

[01/09/2026-20:45:49] [TRT-LLM] [I] get signal from executor worker

INFO:     Started server process [150]

INFO:     Waiting for application startup.

INFO:     Application startup complete.

[01/09/2026-20:47:17] [TRT-LLM] [E] Error in harmony chat completion: %s error downloading or loading vocab file: failed to download or load vocab file

INFO:     127.0.0.1:51222 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request

[01/09/2026-20:47:18] [TRT-LLM] [E] Error in harmony chat completion: %s error downloading or loading vocab file: failed to download or load vocab file

INFO:     127.0.0.1:51222 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request

[01/09/2026-20:47:18] [TRT-LLM] [E] Error in harmony chat completion: %s error downloading or loading vocab file: failed to download or load vocab file

INFO:     127.0.0.1:51222 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request

It won’t serve tokens in Open webui either. It looks like some kind of template configuration problem, but it should have pulled everything it needed from either HF or the container.

I was able to get llama3 8b running 1.2.0rc6 earlier today. I’ll try to benchmark that next.

1 Like

Do you have a TIKTOKEN_ENCODINGS_BASE path set?
In my Docker build those are baked into the build, but gpt-oss-120b requires them to work:

mkdir -p tiktoken_encodings
wget -O tiktoken_encodings/o200k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken"
wget -O tiktoken_encodings/cl100k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken"
export TIKTOKEN_ENCODINGS_BASE=${PWD}/tiktoken_encodings

I used the script in the playbook which looks like it pulls them. I’ll check again. Meanwhile, benched llama3.1-8b-instruct-fp4 and here are the results:

@eugr docker vllm

| model                            |            test |               t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:---------------------------------|----------------:|------------------:|----------------:|----------------:|----------------:|
| nvidia/Llama-3.1-8B-Instruct-FP4 |          pp2048 | 10015.95 ± 649.24 |  248.83 ± 12.80 |  205.40 ± 12.80 |  248.88 ± 12.79 |
| nvidia/Llama-3.1-8B-Instruct-FP4 |            tg32 |      36.70 ± 0.17 |                 |                 |                 |
| nvidia/Llama-3.1-8B-Instruct-FP4 |  pp2048 @ d4096 |  8433.03 ± 163.57 |  772.31 ± 14.39 |  728.88 ± 14.39 |  772.38 ± 14.40 |
| nvidia/Llama-3.1-8B-Instruct-FP4 |    tg32 @ d4096 |      33.89 ± 0.11 |                 |                 |                 |
| nvidia/Llama-3.1-8B-Instruct-FP4 |  pp2048 @ d8192 |   7353.33 ± 85.36 | 1436.27 ± 16.11 | 1392.84 ± 16.11 | 1436.33 ± 16.10 |
| nvidia/Llama-3.1-8B-Instruct-FP4 |    tg32 @ d8192 |      30.49 ± 1.49 |                 |                 |                 |
| nvidia/Llama-3.1-8B-Instruct-FP4 | pp2048 @ d16384 |    5952.05 ± 3.48 |  3140.40 ± 1.89 |  3096.97 ± 1.89 |  3140.48 ± 1.88 |
| nvidia/Llama-3.1-8B-Instruct-FP4 |   tg32 @ d16384 |      27.57 ± 0.05 |                 |                 |                 |
| nvidia/Llama-3.1-8B-Instruct-FP4 | pp2048 @ d32768 |    4346.65 ± 9.40 | 8053.53 ± 17.32 | 8010.11 ± 17.32 | 8053.60 ± 17.32 |
| nvidia/Llama-3.1-8B-Instruct-FP4 |   tg32 @ d32768 |      21.89 ± 0.09 |                 |                 |                 |

tensorrt-llm:1.2.0rc6


| model                            |            test |               t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |

|:---------------------------------|----------------:|------------------:|----------------:|----------------:|----------------:|

| nvidia/Llama-3.1-8B-Instruct-FP4 |          pp2048 | 14407.57 ± 430.30 |   185.58 ± 4.34 |   142.35 ± 4.34 |   185.68 ± 4.37 |

| nvidia/Llama-3.1-8B-Instruct-FP4 |            tg32 |      20.75 ± 0.01 |                 |                 |                 |

| nvidia/Llama-3.1-8B-Instruct-FP4 |  pp2048 @ d4096 |   6513.93 ± 32.10 |   986.67 ± 4.72 |   943.44 ± 4.72 |   986.72 ± 4.70 |

| nvidia/Llama-3.1-8B-Instruct-FP4 |    tg32 @ d4096 |      20.65 ± 0.01 |                 |                 |                 |

| nvidia/Llama-3.1-8B-Instruct-FP4 |  pp2048 @ d8192 |   9361.80 ± 11.61 |  1137.18 ± 1.31 |  1093.95 ± 1.31 |  1137.22 ± 1.31 |

| nvidia/Llama-3.1-8B-Instruct-FP4 |    tg32 @ d8192 |      20.58 ± 0.04 |                 |                 |                 |

| nvidia/Llama-3.1-8B-Instruct-FP4 | pp2048 @ d16384 |   7755.59 ± 13.52 |  2419.93 ± 4.09 |  2376.70 ± 4.09 |  2419.98 ± 4.10 |

| nvidia/Llama-3.1-8B-Instruct-FP4 |   tg32 @ d16384 |      20.00 ± 0.03 |                 |                 |                 |

| nvidia/Llama-3.1-8B-Instruct-FP4 | pp2048 @ d32768 |   5746.15 ± 43.11 | 6102.76 ± 45.50 | 6059.53 ± 45.50 | 6102.81 ± 45.50 |

| nvidia/Llama-3.1-8B-Instruct-FP4 |   tg32 @ d32768 |      18.45 ± 0.04 |                 |                 |   

So it wins PP against vllm at some depths, loses at others. Strange that speed goes up from 4096 to 8192 for sure. Wasn’t doing anything else on the system at the time.

t/g is abysmal across the board relative to vllm. I’’m not sure what the advantage is, and I definitely don’t know what the 2.5x comparison is to.

Will look at gpt-oss script again and see if I can bench it.

Yup – hadn’t set the TIKTOKEN source. Duh. Anyways, last update on this with GPT-OSS-120B

@eugr vllm

| model               |            test |             t/s |        ttfr (ms) |     est_ppt (ms) |    e2e_ttft (ms) |
|:--------------------|----------------:|----------------:|-----------------:|-----------------:|-----------------:|
| openai/gpt-oss-120b |          pp2048 | 4585.37 ± 11.38 |    525.33 ± 1.11 |    446.64 ± 1.11 |    617.47 ± 0.92 |
| openai/gpt-oss-120b |            tg32 |    33.53 ± 0.05 |                  |                  |                  |
| openai/gpt-oss-120b |  pp2048 @ d4096 | 3731.46 ± 17.66 |   1725.27 ± 7.77 |   1646.58 ± 7.77 |   1822.66 ± 7.08 |
| openai/gpt-oss-120b |    tg32 @ d4096 |    31.97 ± 0.01 |                  |                  |                  |
| openai/gpt-oss-120b |  pp2048 @ d8192 |  3326.12 ± 4.23 |   3157.36 ± 3.92 |   3078.67 ± 3.92 |   3257.76 ± 3.16 |
| openai/gpt-oss-120b |    tg32 @ d8192 |    30.55 ± 0.04 |                  |                  |                  |
| openai/gpt-oss-120b | pp2048 @ d16384 |  2777.63 ± 4.83 |  6714.59 ± 11.54 |  6635.90 ± 11.54 |  6822.92 ± 11.45 |
| openai/gpt-oss-120b |   tg32 @ d16384 |    28.36 ± 0.01 |                  |                  |                  |
| openai/gpt-oss-120b | pp2048 @ d32768 |  2106.94 ± 5.63 | 16603.22 ± 44.10 | 16524.53 ± 44.10 | 16727.80 ± 45.12 |
| openai/gpt-oss-120b |   tg32 @ d32768 |    24.71 ± 0.05 |                  |                  |                  |

TensorRT-LLM 1.2.0rc6

| model               |            test |             t/s |      ttfr (ms) |   est_ppt (ms) |    e2e_ttft (ms) |

|:--------------------|----------------:|----------------:|---------------:|---------------:|-----------------:|

| openai/gpt-oss-120b |          pp2048 | 5695.77 ± 26.03 |  412.05 ± 1.73 |  359.84 ± 1.73 |  1115.66 ± 59.41 |

| openai/gpt-oss-120b |            tg32 |    26.11 ± 0.95 |                |                |                  |

| openai/gpt-oss-120b |  pp2048 @ d8192 |  7021.73 ± 0.00 | 1510.69 ± 0.00 | 1458.47 ± 0.00 |   2481.45 ± 0.00 |

| openai/gpt-oss-120b |    tg32 @ d8192 |    22.05 ± 0.00 |                |                |                  |

| openai/gpt-oss-120b | pp2048 @ d16384 | 6015.25 ± 10.12 | 3116.55 ± 5.12 | 3064.33 ± 5.12 | 3915.91 ± 139.12 |

| openai/gpt-oss-120b |   tg32 @ d16384 |    23.62 ± 1.44 |                |                |                  |

| openai/gpt-oss-120b | pp2048 @ d32768 |  4494.18 ± 0.00 | 7799.34 ± 0.00 | 7747.13 ± 0.00 |   8695.32 ± 0.00 |

| openai/gpt-oss-120b |   tg32 @ d32768 |    21.80 ± 0.00 |                |                |                  |




So it looks like the boost on prefill is real. Those generations speeds though…

1 Like

I’m unsure if you’re strictly talking about the recent announcement/live stream, but Nvidia hosts the model nvidia/Qwen3-14B-FP4 which is NVFP4. It’s much faster than vanilla Qwen3-14B.

Is it faster than Qwen3-14B-AWQ though? Of course 4-bit quant will be faster than full fp16 model.

Yeah, nice prefill speeds, but generation speeds are bad even in vLLM, should be in mid-high 50’s on low context…

Did you run their Docker container or built it from the source?

Their container. I didn’t feel like putting myself through whatever torture of ENVS were required to get it built 🤣

1 Like

right.- that’s the misleading part I think. Being 2.5x faster than a full weight model with 4 bit quant is a ridiculously low bar. Particularly with sparks limitations. The real comparisons are other quants and inference providers — and it loses there.

I tried the NVFP4 quant of Flux2 dev in ComfyUI and it was twice as slow as the fp8 version, which was disappointing.

If anyone is a ComfyUI guru maybe you could critique my launch parameters:

python3 main.py
–listen 0.0.0.0
–disable-mmap
–use-sage-attention
–supports-fp8-compute
–gpu-only
–cache-none
–fp16-unet
–fp16-vae
–fp16-text-enc
–disable-pinned-memory

Total time for 2MP generation (20 steps) is consistently 260s. EDIT: this was including a reference image as conditioning input. Just plain text to image is a lot faster at 89s. By comparison the same prompt and workflow using NVFP4 took 198s.

Which is is fine for my purposes but I am somewhat regretting having sold the 5090 I bought at GTC to a friend to buy my DGX Spark instead. Especially since it only cost $2100 tax incl (albeit that was a reward for standing in a long line in the dark).

And here are the results of the German Jury.

###  nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc7
###  latest

llama-benchy --base-url http://localhost:8355/v1 --model nvidia/Llama-3.1-8B-Instruct-FP4 --depth 0 4096 8192 16384 32768 --latency-mode generation

| model                            |            test |               t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:---------------------------------|----------------:|------------------:|----------------:|----------------:|----------------:|
| nvidia/Llama-3.1-8B-Instruct-FP4 |          pp2048 | 14897.91 ± 357.73 |   185.95 ± 3.34 |   137.64 ± 3.34 |   186.06 ± 3.36 |
| nvidia/Llama-3.1-8B-Instruct-FP4 |            tg32 |      21.13 ± 0.04 |                 |                 |                 |
| nvidia/Llama-3.1-8B-Instruct-FP4 |  pp2048 @ d4096 |   6660.20 ± 28.44 |   971.03 ± 3.93 |   922.71 ± 3.93 |   971.10 ± 3.94 |
| nvidia/Llama-3.1-8B-Instruct-FP4 |    tg32 @ d4096 |      21.06 ± 0.03 |                 |                 |                 |
| nvidia/Llama-3.1-8B-Instruct-FP4 |  pp2048 @ d8192 |   9598.51 ± 41.80 |  1115.24 ± 4.61 |  1066.92 ± 4.61 |  1115.28 ± 4.61 |
| nvidia/Llama-3.1-8B-Instruct-FP4 |    tg32 @ d8192 |      21.01 ± 0.03 |                 |                 |                 |
| nvidia/Llama-3.1-8B-Instruct-FP4 | pp2048 @ d16384 |   7951.91 ± 13.73 |  2366.38 ± 4.01 |  2318.07 ± 4.01 |  2366.43 ± 4.01 |
| nvidia/Llama-3.1-8B-Instruct-FP4 |   tg32 @ d16384 |      20.48 ± 0.00 |                 |                 |                 |
| nvidia/Llama-3.1-8B-Instruct-FP4 | pp2048 @ d32768 |   6032.90 ± 20.29 | 5819.51 ± 19.50 | 5771.20 ± 19.50 | 5819.56 ± 19.50 |
| nvidia/Llama-3.1-8B-Instruct-FP4 |   tg32 @ d32768 |      18.88 ± 0.01 |                 |                 |                 |

llama-benchy (0.1.1)
date: 2026-01-10 14:18:13 | latency mode: generation


###  llama.cpp -- build: 7542 (af3be131c) with GNU 13.3.0 for Linux aarch64
###  using Q4_K_M

llama-benchy --base-url http://localhost:8102/v1 --model bartowski/Meta-Llama-3.1-8B-Instruct-GGUF  --depth 0 4096 8192 16384 32768 --latency-mode generation

| model                                     |            test |              t/s |          ttfr (ms) |       est_ppt (ms) |      e2e_ttft (ms) |
|:------------------------------------------|----------------:|-----------------:|-------------------:|-------------------:|-------------------:|
| bartowski/Meta-Llama-3.1-8B-Instruct-GGUF |          pp2048 | 3057.26 ± 132.60 |     642.70 ± 31.43 |     610.61 ± 31.43 |     642.73 ± 31.43 |
| bartowski/Meta-Llama-3.1-8B-Instruct-GGUF |            tg32 |     38.84 ± 1.37 |                    |                    |                    |
| bartowski/Meta-Llama-3.1-8B-Instruct-GGUF |  pp2048 @ d4096 | 2226.69 ± 214.98 |   2497.00 ± 273.53 |   2464.91 ± 273.53 |   2497.03 ± 273.53 |
| bartowski/Meta-Llama-3.1-8B-Instruct-GGUF |    tg32 @ d4096 |     31.70 ± 1.60 |                    |                    |                    |
| bartowski/Meta-Llama-3.1-8B-Instruct-GGUF |  pp2048 @ d8192 |  1653.04 ± 95.77 |   5531.51 ± 354.28 |   5499.42 ± 354.28 |   5531.54 ± 354.28 |
| bartowski/Meta-Llama-3.1-8B-Instruct-GGUF |    tg32 @ d8192 |     25.32 ± 1.09 |                    |                    |                    |
| bartowski/Meta-Llama-3.1-8B-Instruct-GGUF | pp2048 @ d16384 | 1236.16 ± 102.74 | 13333.55 ± 1100.42 | 13301.46 ± 1100.42 | 13333.59 ± 1100.42 |
| bartowski/Meta-Llama-3.1-8B-Instruct-GGUF |   tg32 @ d16384 |     19.52 ± 1.26 |                    |                    |                    |
| bartowski/Meta-Llama-3.1-8B-Instruct-GGUF | pp2048 @ d32768 |  842.02 ± 103.26 | 37106.00 ± 4553.80 | 37073.91 ± 4553.80 | 37106.02 ± 4553.80 |
| bartowski/Meta-Llama-3.1-8B-Instruct-GGUF |   tg32 @ d32768 |     13.24 ± 1.15 |                    |                    |                    |

llama-benchy (0.1.1)
date: 2026-01-10 16:56:28 | latency mode: generation


###   vLLM API server version 0.14.0rc1.dev135+gc3666f56f -- eugr docker recipe

llama-benchy --base-url http://localhost:8102/v1 --model stelterlab/Llama-3.1-8B-Instruct-AWQ --depth 0 4096 8192 16384 32768 --latency-mode generation

| model                                |            test |             t/s |        ttfr (ms) |     est_ppt (ms) |    e2e_ttft (ms) |
|:-------------------------------------|----------------:|----------------:|-----------------:|-----------------:|-----------------:|
| stelterlab/Llama-3.1-8B-Instruct-AWQ |          pp2048 | 4064.87 ± 64.52 |    543.23 ± 8.09 |    504.20 ± 8.09 |    543.30 ± 8.11 |
| stelterlab/Llama-3.1-8B-Instruct-AWQ |            tg32 |    40.31 ± 2.87 |                  |                  |                  |
| stelterlab/Llama-3.1-8B-Instruct-AWQ |  pp2048 @ d4096 | 3807.67 ± 10.21 |   1652.80 ± 4.29 |   1613.77 ± 4.29 |   1652.86 ± 4.30 |
| stelterlab/Llama-3.1-8B-Instruct-AWQ |    tg32 @ d4096 |    38.80 ± 0.04 |                  |                  |                  |
| stelterlab/Llama-3.1-8B-Instruct-AWQ |  pp2048 @ d8192 |  3604.85 ± 4.98 |   2879.93 ± 3.93 |   2840.90 ± 3.93 |   2880.00 ± 3.93 |
| stelterlab/Llama-3.1-8B-Instruct-AWQ |    tg32 @ d8192 |    35.74 ± 0.12 |                  |                  |                  |
| stelterlab/Llama-3.1-8B-Instruct-AWQ | pp2048 @ d16384 | 3212.47 ± 47.93 |  5778.48 ± 86.39 |  5739.45 ± 86.39 |  5778.55 ± 86.38 |
| stelterlab/Llama-3.1-8B-Instruct-AWQ |   tg32 @ d16384 |    29.60 ± 1.56 |                  |                  |                  |
| stelterlab/Llama-3.1-8B-Instruct-AWQ | pp2048 @ d32768 |  2690.97 ± 3.25 | 12977.51 ± 15.92 | 12938.48 ± 15.92 | 12977.57 ± 15.91 |
| stelterlab/Llama-3.1-8B-Instruct-AWQ |   tg32 @ d32768 |    24.02 ± 0.04 |                  |                  |                  |

llama-benchy (0.1.1)
date: 2026-01-10 17:11:49 | latency mode: generation

Tried to run nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 also, but that failed with

[Autotuner] Failed when profiling runner=<tensorrt_llm._torch.custom_ops.torch_custom_ops.MoERunner object at 0xfd52e065d940>, tactic=6, shapes=[torch.Size([1, 1024]), torch.Size([512, 1024, 128]), torch.Size([0]), torch.Size([512, 2048, 32]), torch.Size([0])]. Error: [TensorRT-LLM][ERROR] Assertion failed: Failed to initialize cutlass TMA WS grouped gemm. Error: Error Internal (tensorrt_llm/kernels/cutlass_kernels/cutlass_instantiations/gemm_grouped/120/cutlass_kernel_file_gemm_grouped_sm120_M128_BS_group2.generated.cu:39

Won’t invest more time in TRT-LLM for the moment as it is still disappointing on GB10.

I just read an interesting article Jensen Steps In from the estimable Business Insider. It claims that the Nvidia CEO noticed the disquiet among Spark users after it was introduced and intervened to get attention placed on improving support for software. According to the article, it was complaints from well-placed customers what did it! So if any of you are well-placed, please complain!

3 Likes

Thanks for testing! Yes, the generation speeds are disappointing, but those prompt processing speeds on the other hand… At least we know it can be faster!