NVIDIA folks -- where is this promised nvfp4 speedup?

josephbreda · January 9, 2026, 3:11pm

Today, NVIDIA has announced that with NVFP4 support, the DGX Spark delivers up to a 2.5x boost in the Qwen 235B model (two DGX Sparks paired).

Boost from what, running on a CPU? Early adopters have been patiently waiting for proper software support. It makes it even worse to read this marketing spin – which probably cost more than a few developers dedicated to getting the stack optimized.

Others on the forum – am I wrong here?

cosinus · January 9, 2026, 4:00pm

The graph shows the boost for TRT-LLM for Qwen 235B on two Sparks.

Today, NVIDIA has announced that with NVFP4 support, the DGX Spark delivers up to a 2.5x boost in the Qwen 235B model (two DGX Sparks paired).

So may be it’s time to test the latest TensorRT LLM version again.

or

josephbreda · January 9, 2026, 4:15pm

I’m going to do the same

eugr · January 9, 2026, 7:10pm

I wonder why they didn’t just post the performance numbers.
I wouldn’t be surprised if they just achieved the same performance we can already get in vLLM.

If anyone tries the newest TRTLLM before me, please post the benchmarks, preferably using llama-benchy.

josephbreda · January 9, 2026, 7:16pm

Just ran two models on it. I can bench it, but can tell right away it’s slower than vllm. The “2.5x” must be compared to full weight model on the same stack?

eugr · January 9, 2026, 7:19pm

Well, as expected… No wonder they pulled a tactic from Apple marketing playbook and didn’t give any specific performance numbers.

josephbreda · January 9, 2026, 8:55pm

Well, I ran llama_benchy and it went like this:

uv run llama-benchy   --base-url http://localhost:8355/v1   --model openai/gpt-oss-120b   --depth 0 4096 8192 16384 32768   --latency-mode generation
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
llama-benchy (0.1.1)
Date: 2026-01-09 14:47:15
Benchmarking model: openai/gpt-oss-120b at http://localhost:8355/v1
Loading text from cache: /home/joseph/.cache/llama-benchy/cc6a0b5782734ee3b9069aa3b64cc62c.txt
Total tokens available in text corpus: 140865
Warming up...
Warmup (User only) complete (no usage stats found).
Warmup (System+Empty) complete (no usage stats found).
Measuring latency using mode: generation...
Average latency (generation): 121.13 ms
Running test: pp=2048, tg=32, depth=0
Error: 400 - {"object":"error","message":"error downloading or loading vocab file: failed to download or load vocab file","type":"internal_error","param":null,"code":400}
Error: 400 - {"object":"error","message":"error downloading or loading vocab file: failed to download or load vocab file","type":"internal_error","param":null,"code":400}
Error: 400 - {"object":"error","message":"error downloading or loading vocab file: failed to download or load vocab file","type":"internal_error","param":null,"code":400}

And meanwhile TensorRT-LLM (following verbatim arguments from playbook except increased context length to 64K) said this:

[01/09/2026-20:45:49] [TRT-LLM] [I] get signal from executor worker

INFO:     Started server process [150]

INFO:     Waiting for application startup.

INFO:     Application startup complete.

[01/09/2026-20:47:17] [TRT-LLM] [E] Error in harmony chat completion: %s error downloading or loading vocab file: failed to download or load vocab file

INFO:     127.0.0.1:51222 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request

[01/09/2026-20:47:18] [TRT-LLM] [E] Error in harmony chat completion: %s error downloading or loading vocab file: failed to download or load vocab file

INFO:     127.0.0.1:51222 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request

[01/09/2026-20:47:18] [TRT-LLM] [E] Error in harmony chat completion: %s error downloading or loading vocab file: failed to download or load vocab file

INFO:     127.0.0.1:51222 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request

It won’t serve tokens in Open webui either. It looks like some kind of template configuration problem, but it should have pulled everything it needed from either HF or the container.

I was able to get llama3 8b running 1.2.0rc6 earlier today. I’ll try to benchmark that next.

eugr · January 9, 2026, 9:16pm

Do you have a TIKTOKEN_ENCODINGS_BASE path set?
In my Docker build those are baked into the build, but gpt-oss-120b requires them to work:

mkdir -p tiktoken_encodings
wget -O tiktoken_encodings/o200k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken"
wget -O tiktoken_encodings/cl100k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken"
export TIKTOKEN_ENCODINGS_BASE=${PWD}/tiktoken_encodings

josephbreda · January 9, 2026, 10:03pm

I used the script in the playbook which looks like it pulls them. I’ll check again. Meanwhile, benched llama3.1-8b-instruct-fp4 and here are the results:

@eugr docker vllm

| model                            |            test |               t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:---------------------------------|----------------:|------------------:|----------------:|----------------:|----------------:|
| nvidia/Llama-3.1-8B-Instruct-FP4 |          pp2048 | 10015.95 ± 649.24 |  248.83 ± 12.80 |  205.40 ± 12.80 |  248.88 ± 12.79 |
| nvidia/Llama-3.1-8B-Instruct-FP4 |            tg32 |      36.70 ± 0.17 |                 |                 |                 |
| nvidia/Llama-3.1-8B-Instruct-FP4 |  pp2048 @ d4096 |  8433.03 ± 163.57 |  772.31 ± 14.39 |  728.88 ± 14.39 |  772.38 ± 14.40 |
| nvidia/Llama-3.1-8B-Instruct-FP4 |    tg32 @ d4096 |      33.89 ± 0.11 |                 |                 |                 |
| nvidia/Llama-3.1-8B-Instruct-FP4 |  pp2048 @ d8192 |   7353.33 ± 85.36 | 1436.27 ± 16.11 | 1392.84 ± 16.11 | 1436.33 ± 16.10 |
| nvidia/Llama-3.1-8B-Instruct-FP4 |    tg32 @ d8192 |      30.49 ± 1.49 |                 |                 |                 |
| nvidia/Llama-3.1-8B-Instruct-FP4 | pp2048 @ d16384 |    5952.05 ± 3.48 |  3140.40 ± 1.89 |  3096.97 ± 1.89 |  3140.48 ± 1.88 |
| nvidia/Llama-3.1-8B-Instruct-FP4 |   tg32 @ d16384 |      27.57 ± 0.05 |                 |                 |                 |
| nvidia/Llama-3.1-8B-Instruct-FP4 | pp2048 @ d32768 |    4346.65 ± 9.40 | 8053.53 ± 17.32 | 8010.11 ± 17.32 | 8053.60 ± 17.32 |
| nvidia/Llama-3.1-8B-Instruct-FP4 |   tg32 @ d32768 |      21.89 ± 0.09 |                 |                 |                 |

tensorrt-llm:1.2.0rc6


| model                            |            test |               t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |

|:---------------------------------|----------------:|------------------:|----------------:|----------------:|----------------:|

| nvidia/Llama-3.1-8B-Instruct-FP4 |          pp2048 | 14407.57 ± 430.30 |   185.58 ± 4.34 |   142.35 ± 4.34 |   185.68 ± 4.37 |

| nvidia/Llama-3.1-8B-Instruct-FP4 |            tg32 |      20.75 ± 0.01 |                 |                 |                 |

| nvidia/Llama-3.1-8B-Instruct-FP4 |  pp2048 @ d4096 |   6513.93 ± 32.10 |   986.67 ± 4.72 |   943.44 ± 4.72 |   986.72 ± 4.70 |

| nvidia/Llama-3.1-8B-Instruct-FP4 |    tg32 @ d4096 |      20.65 ± 0.01 |                 |                 |                 |

| nvidia/Llama-3.1-8B-Instruct-FP4 |  pp2048 @ d8192 |   9361.80 ± 11.61 |  1137.18 ± 1.31 |  1093.95 ± 1.31 |  1137.22 ± 1.31 |

| nvidia/Llama-3.1-8B-Instruct-FP4 |    tg32 @ d8192 |      20.58 ± 0.04 |                 |                 |                 |

| nvidia/Llama-3.1-8B-Instruct-FP4 | pp2048 @ d16384 |   7755.59 ± 13.52 |  2419.93 ± 4.09 |  2376.70 ± 4.09 |  2419.98 ± 4.10 |

| nvidia/Llama-3.1-8B-Instruct-FP4 |   tg32 @ d16384 |      20.00 ± 0.03 |                 |                 |                 |

| nvidia/Llama-3.1-8B-Instruct-FP4 | pp2048 @ d32768 |   5746.15 ± 43.11 | 6102.76 ± 45.50 | 6059.53 ± 45.50 | 6102.81 ± 45.50 |

| nvidia/Llama-3.1-8B-Instruct-FP4 |   tg32 @ d32768 |      18.45 ± 0.04 |                 |                 |

So it wins PP against vllm at some depths, loses at others. Strange that speed goes up from 4096 to 8192 for sure. Wasn’t doing anything else on the system at the time.

t/g is abysmal across the board relative to vllm. I’’m not sure what the advantage is, and I definitely don’t know what the 2.5x comparison is to.

Will look at gpt-oss script again and see if I can bench it.

josephbreda · January 9, 2026, 10:34pm

Yup – hadn’t set the TIKTOKEN source. Duh. Anyways, last update on this with GPT-OSS-120B

@eugr vllm

| model               |            test |             t/s |        ttfr (ms) |     est_ppt (ms) |    e2e_ttft (ms) |
|:--------------------|----------------:|----------------:|-----------------:|-----------------:|-----------------:|
| openai/gpt-oss-120b |          pp2048 | 4585.37 ± 11.38 |    525.33 ± 1.11 |    446.64 ± 1.11 |    617.47 ± 0.92 |
| openai/gpt-oss-120b |            tg32 |    33.53 ± 0.05 |                  |                  |                  |
| openai/gpt-oss-120b |  pp2048 @ d4096 | 3731.46 ± 17.66 |   1725.27 ± 7.77 |   1646.58 ± 7.77 |   1822.66 ± 7.08 |
| openai/gpt-oss-120b |    tg32 @ d4096 |    31.97 ± 0.01 |                  |                  |                  |
| openai/gpt-oss-120b |  pp2048 @ d8192 |  3326.12 ± 4.23 |   3157.36 ± 3.92 |   3078.67 ± 3.92 |   3257.76 ± 3.16 |
| openai/gpt-oss-120b |    tg32 @ d8192 |    30.55 ± 0.04 |                  |                  |                  |
| openai/gpt-oss-120b | pp2048 @ d16384 |  2777.63 ± 4.83 |  6714.59 ± 11.54 |  6635.90 ± 11.54 |  6822.92 ± 11.45 |
| openai/gpt-oss-120b |   tg32 @ d16384 |    28.36 ± 0.01 |                  |                  |                  |
| openai/gpt-oss-120b | pp2048 @ d32768 |  2106.94 ± 5.63 | 16603.22 ± 44.10 | 16524.53 ± 44.10 | 16727.80 ± 45.12 |
| openai/gpt-oss-120b |   tg32 @ d32768 |    24.71 ± 0.05 |                  |                  |                  |

TensorRT-LLM 1.2.0rc6

| model               |            test |             t/s |      ttfr (ms) |   est_ppt (ms) |    e2e_ttft (ms) |

|:--------------------|----------------:|----------------:|---------------:|---------------:|-----------------:|

| openai/gpt-oss-120b |          pp2048 | 5695.77 ± 26.03 |  412.05 ± 1.73 |  359.84 ± 1.73 |  1115.66 ± 59.41 |

| openai/gpt-oss-120b |            tg32 |    26.11 ± 0.95 |                |                |                  |

| openai/gpt-oss-120b |  pp2048 @ d8192 |  7021.73 ± 0.00 | 1510.69 ± 0.00 | 1458.47 ± 0.00 |   2481.45 ± 0.00 |

| openai/gpt-oss-120b |    tg32 @ d8192 |    22.05 ± 0.00 |                |                |                  |

| openai/gpt-oss-120b | pp2048 @ d16384 | 6015.25 ± 10.12 | 3116.55 ± 5.12 | 3064.33 ± 5.12 | 3915.91 ± 139.12 |

| openai/gpt-oss-120b |   tg32 @ d16384 |    23.62 ± 1.44 |                |                |                  |

| openai/gpt-oss-120b | pp2048 @ d32768 |  4494.18 ± 0.00 | 7799.34 ± 0.00 | 7747.13 ± 0.00 |   8695.32 ± 0.00 |

| openai/gpt-oss-120b |   tg32 @ d32768 |    21.80 ± 0.00 |                |                |                  |

So it looks like the boost on prefill is real. Those generations speeds though…

notmy.reward438 · January 9, 2026, 11:02pm

I’m unsure if you’re strictly talking about the recent announcement/live stream, but Nvidia hosts the model nvidia/Qwen3-14B-FP4 which is NVFP4. It’s much faster than vanilla Qwen3-14B.

eugr · January 9, 2026, 11:59pm

Is it faster than Qwen3-14B-AWQ though? Of course 4-bit quant will be faster than full fp16 model.

eugr · January 10, 2026, 12:02am

Yeah, nice prefill speeds, but generation speeds are bad even in vLLM, should be in mid-high 50’s on low context…

eugr · January 10, 2026, 12:02am

Did you run their Docker container or built it from the source?

josephbreda · January 10, 2026, 12:05am

Their container. I didn’t feel like putting myself through whatever torture of ENVS were required to get it built 🤣

josephbreda · January 10, 2026, 12:07am

right.- that’s the misleading part I think. Being 2.5x faster than a full weight model with 4 bit quant is a ridiculously low bar. Particularly with sparks limitations. The real comparisons are other quants and inference providers — and it loses there.

haidij · January 10, 2026, 12:28pm

I tried the NVFP4 quant of Flux2 dev in ComfyUI and it was twice as slow as the fp8 version, which was disappointing.

If anyone is a ComfyUI guru maybe you could critique my launch parameters:

python3 main.py
–listen 0.0.0.0
–disable-mmap
–use-sage-attention
–supports-fp8-compute
–gpu-only
–cache-none
–fp16-unet
–fp16-vae
–fp16-text-enc
–disable-pinned-memory

Total time for 2MP generation (20 steps) is consistently 260s. EDIT: this was including a reference image as conditioning input. Just plain text to image is a lot faster at 89s. By comparison the same prompt and workflow using NVFP4 took 198s.

Which is is fine for my purposes but I am somewhat regretting having sold the 5090 I bought at GTC to a friend to buy my DGX Spark instead. Especially since it only cost $2100 tax incl (albeit that was a reward for standing in a long line in the dark).

cosinus · January 10, 2026, 4:22pm

And here are the results of the German Jury.

###  nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc7
###  latest

llama-benchy --base-url http://localhost:8355/v1 --model nvidia/Llama-3.1-8B-Instruct-FP4 --depth 0 4096 8192 16384 32768 --latency-mode generation

| model                            |            test |               t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:---------------------------------|----------------:|------------------:|----------------:|----------------:|----------------:|
| nvidia/Llama-3.1-8B-Instruct-FP4 |          pp2048 | 14897.91 ± 357.73 |   185.95 ± 3.34 |   137.64 ± 3.34 |   186.06 ± 3.36 |
| nvidia/Llama-3.1-8B-Instruct-FP4 |            tg32 |      21.13 ± 0.04 |                 |                 |                 |
| nvidia/Llama-3.1-8B-Instruct-FP4 |  pp2048 @ d4096 |   6660.20 ± 28.44 |   971.03 ± 3.93 |   922.71 ± 3.93 |   971.10 ± 3.94 |
| nvidia/Llama-3.1-8B-Instruct-FP4 |    tg32 @ d4096 |      21.06 ± 0.03 |                 |                 |                 |
| nvidia/Llama-3.1-8B-Instruct-FP4 |  pp2048 @ d8192 |   9598.51 ± 41.80 |  1115.24 ± 4.61 |  1066.92 ± 4.61 |  1115.28 ± 4.61 |
| nvidia/Llama-3.1-8B-Instruct-FP4 |    tg32 @ d8192 |      21.01 ± 0.03 |                 |                 |                 |
| nvidia/Llama-3.1-8B-Instruct-FP4 | pp2048 @ d16384 |   7951.91 ± 13.73 |  2366.38 ± 4.01 |  2318.07 ± 4.01 |  2366.43 ± 4.01 |
| nvidia/Llama-3.1-8B-Instruct-FP4 |   tg32 @ d16384 |      20.48 ± 0.00 |                 |                 |                 |
| nvidia/Llama-3.1-8B-Instruct-FP4 | pp2048 @ d32768 |   6032.90 ± 20.29 | 5819.51 ± 19.50 | 5771.20 ± 19.50 | 5819.56 ± 19.50 |
| nvidia/Llama-3.1-8B-Instruct-FP4 |   tg32 @ d32768 |      18.88 ± 0.01 |                 |                 |                 |

llama-benchy (0.1.1)
date: 2026-01-10 14:18:13 | latency mode: generation


###  llama.cpp -- build: 7542 (af3be131c) with GNU 13.3.0 for Linux aarch64
###  using Q4_K_M

llama-benchy --base-url http://localhost:8102/v1 --model bartowski/Meta-Llama-3.1-8B-Instruct-GGUF  --depth 0 4096 8192 16384 32768 --latency-mode generation

| model                                     |            test |              t/s |          ttfr (ms) |       est_ppt (ms) |      e2e_ttft (ms) |
|:------------------------------------------|----------------:|-----------------:|-------------------:|-------------------:|-------------------:|
| bartowski/Meta-Llama-3.1-8B-Instruct-GGUF |          pp2048 | 3057.26 ± 132.60 |     642.70 ± 31.43 |     610.61 ± 31.43 |     642.73 ± 31.43 |
| bartowski/Meta-Llama-3.1-8B-Instruct-GGUF |            tg32 |     38.84 ± 1.37 |                    |                    |                    |
| bartowski/Meta-Llama-3.1-8B-Instruct-GGUF |  pp2048 @ d4096 | 2226.69 ± 214.98 |   2497.00 ± 273.53 |   2464.91 ± 273.53 |   2497.03 ± 273.53 |
| bartowski/Meta-Llama-3.1-8B-Instruct-GGUF |    tg32 @ d4096 |     31.70 ± 1.60 |                    |                    |                    |
| bartowski/Meta-Llama-3.1-8B-Instruct-GGUF |  pp2048 @ d8192 |  1653.04 ± 95.77 |   5531.51 ± 354.28 |   5499.42 ± 354.28 |   5531.54 ± 354.28 |
| bartowski/Meta-Llama-3.1-8B-Instruct-GGUF |    tg32 @ d8192 |     25.32 ± 1.09 |                    |                    |                    |
| bartowski/Meta-Llama-3.1-8B-Instruct-GGUF | pp2048 @ d16384 | 1236.16 ± 102.74 | 13333.55 ± 1100.42 | 13301.46 ± 1100.42 | 13333.59 ± 1100.42 |
| bartowski/Meta-Llama-3.1-8B-Instruct-GGUF |   tg32 @ d16384 |     19.52 ± 1.26 |                    |                    |                    |
| bartowski/Meta-Llama-3.1-8B-Instruct-GGUF | pp2048 @ d32768 |  842.02 ± 103.26 | 37106.00 ± 4553.80 | 37073.91 ± 4553.80 | 37106.02 ± 4553.80 |
| bartowski/Meta-Llama-3.1-8B-Instruct-GGUF |   tg32 @ d32768 |     13.24 ± 1.15 |                    |                    |                    |

llama-benchy (0.1.1)
date: 2026-01-10 16:56:28 | latency mode: generation


###   vLLM API server version 0.14.0rc1.dev135+gc3666f56f -- eugr docker recipe

llama-benchy --base-url http://localhost:8102/v1 --model stelterlab/Llama-3.1-8B-Instruct-AWQ --depth 0 4096 8192 16384 32768 --latency-mode generation

| model                                |            test |             t/s |        ttfr (ms) |     est_ppt (ms) |    e2e_ttft (ms) |
|:-------------------------------------|----------------:|----------------:|-----------------:|-----------------:|-----------------:|
| stelterlab/Llama-3.1-8B-Instruct-AWQ |          pp2048 | 4064.87 ± 64.52 |    543.23 ± 8.09 |    504.20 ± 8.09 |    543.30 ± 8.11 |
| stelterlab/Llama-3.1-8B-Instruct-AWQ |            tg32 |    40.31 ± 2.87 |                  |                  |                  |
| stelterlab/Llama-3.1-8B-Instruct-AWQ |  pp2048 @ d4096 | 3807.67 ± 10.21 |   1652.80 ± 4.29 |   1613.77 ± 4.29 |   1652.86 ± 4.30 |
| stelterlab/Llama-3.1-8B-Instruct-AWQ |    tg32 @ d4096 |    38.80 ± 0.04 |                  |                  |                  |
| stelterlab/Llama-3.1-8B-Instruct-AWQ |  pp2048 @ d8192 |  3604.85 ± 4.98 |   2879.93 ± 3.93 |   2840.90 ± 3.93 |   2880.00 ± 3.93 |
| stelterlab/Llama-3.1-8B-Instruct-AWQ |    tg32 @ d8192 |    35.74 ± 0.12 |                  |                  |                  |
| stelterlab/Llama-3.1-8B-Instruct-AWQ | pp2048 @ d16384 | 3212.47 ± 47.93 |  5778.48 ± 86.39 |  5739.45 ± 86.39 |  5778.55 ± 86.38 |
| stelterlab/Llama-3.1-8B-Instruct-AWQ |   tg32 @ d16384 |    29.60 ± 1.56 |                  |                  |                  |
| stelterlab/Llama-3.1-8B-Instruct-AWQ | pp2048 @ d32768 |  2690.97 ± 3.25 | 12977.51 ± 15.92 | 12938.48 ± 15.92 | 12977.57 ± 15.91 |
| stelterlab/Llama-3.1-8B-Instruct-AWQ |   tg32 @ d32768 |    24.02 ± 0.04 |                  |                  |                  |

llama-benchy (0.1.1)
date: 2026-01-10 17:11:49 | latency mode: generation

Tried to run nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 also, but that failed with

[Autotuner] Failed when profiling runner=<tensorrt_llm._torch.custom_ops.torch_custom_ops.MoERunner object at 0xfd52e065d940>, tactic=6, shapes=[torch.Size([1, 1024]), torch.Size([512, 1024, 128]), torch.Size([0]), torch.Size([512, 2048, 32]), torch.Size([0])]. Error: [TensorRT-LLM][ERROR] Assertion failed: Failed to initialize cutlass TMA WS grouped gemm. Error: Error Internal (tensorrt_llm/kernels/cutlass_kernels/cutlass_instantiations/gemm_grouped/120/cutlass_kernel_file_gemm_grouped_sm120_M128_BS_group2.generated.cu:39

Won’t invest more time in TRT-LLM for the moment as it is still disappointing on GB10.

PrinceHal · January 10, 2026, 9:58pm

I just read an interesting article Jensen Steps In from the estimable Business Insider. It claims that the Nvidia CEO noticed the disquiet among Spark users after it was introduced and intervened to get attention placed on improving support for software. According to the article, it was complaints from well-placed customers what did it! So if any of you are well-placed, please complain!

eugr · January 10, 2026, 11:29pm

Thanks for testing! Yes, the generation speeds are disappointing, but those prompt processing speeds on the other hand… At least we know it can be faster!

Topic		Replies	Views
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	1831	December 31, 2025
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	136	1757	February 24, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	1730	December 25, 2025
DGX Spark performance DGX Spark / GB10	49	2583	February 13, 2026
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	10	2252	January 25, 2026
Question on Inference Performance Results of Qwen3 235B A22B on 2× DGX Spark DGX Spark / GB10 cuda	5	446	December 19, 2025
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	667	February 13, 2026
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	89	3554	February 13, 2026
From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f DGX Spark / GB10	10	1173	January 7, 2026
FP4 on DGX Spark — Why It Doesn't Scale Like You'd Expect DGX Spark / GB10	181	2278	February 25, 2026

NVIDIA folks -- where is this promised nvfp4 speedup?

Related topics