Benchmark Report: Qwen3.6-35B-A3B-NVFP4 on NVIDIA DGX Spark, Jetson Thor, Blackwell 6000 Pro

Hello,

I conducted vllm benchmarks of the nvidia/Qwen3.6-35B-A3B-NVFP4 model across three NVIDIA platforms: Jetson Thor, DGX Spark, and Blackwell 6000 Pro. All tests used identical vllm configurations with NVFP4 quantization, flashInfer attention, Marlin MoE backend, and MTP speculative decoding.

I installed the nightly release of vllm using the following command:

uv pip install -U vllm \
  --torch-backend=auto \
  --extra-index-url https://wheels.vllm.ai/nightly

Test different workloads by adjusting input/output lengths:

  • Prompt-heavy: 8000 input / 1000 output
  • Decode-heavy: 1000 input / 8000 output
  • Balanced: 1000 input / 1000 output

The same command for every machine:

vllm serve nvidia/Qwen3.6-35B-A3B-NVFP4 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --trust-remote-code \
  --dtype auto \
  --quantization modelopt \
  --kv-cache-dtype fp8 \
  --attention-backend flashinfer \
  --moe-backend marlin \
  --gpu-memory-utilization 0.85 \
  --max-model-len 65536 \
  --max-num-seqs 4 \
  --max-num-batched-tokens 8192 \
  --enable-chunked-prefill \
  --async-scheduling \
  --enable-prefix-caching \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}'

Nvidia DGX Spark

1. Prompt-heavy

vllm bench serve \
  --model nvidia/Qwen3.6-35B-A3B-NVFP4 \
  --dataset-name random \
  --random-input-len 8000 \
  --random-output-len 1000 \
  --request-rate 10000 \
  --num-prompts 16 \
  --ignore-eos

Output:

============ Serving Benchmark Result ============
Successful requests:                     16        
Failed requests:                         0         
Request rate configured (RPS):           10000.00  
Benchmark duration (s):                  93.22     
Total input tokens:                      128000    
Total generated tokens:                  16000     
Request throughput (req/s):              0.17      
Output token throughput (tok/s):         171.64    
Peak output token throughput (tok/s):    92.00     
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          1544.75   
---------------Time to First Token----------------
Mean TTFT (ms):                          42235.75  
Median TTFT (ms):                        42243.32  
P99 TTFT (ms):                           76218.08  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          18.76     
Median TPOT (ms):                        18.71     
P99 TPOT (ms):                           26.40     
---------------Inter-token Latency----------------
Mean ITL (ms):                           57.44     
Median ITL (ms):                         48.03     
P99 ITL (ms):                            621.62    
---------------Speculative Decoding---------------
Acceptance rate (%):                     68.81     
Acceptance length:                       3.06      
Drafts:                                  5221      
Draft tokens:                            15663     
Accepted tokens:                         10778     
Per-position acceptance (%):
  Position 0:                            80.21     
  Position 1:                            67.96     
  Position 2:                            58.26     
==================================================

2. Decode-heavy

vllm bench serve \
  --model nvidia/Qwen3.6-35B-A3B-NVFP4 \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 8000 \
  --request-rate 10000 \
  --num-prompts 16 \
  --ignore-eos

Output:

============ Serving Benchmark Result ============
Successful requests:                     16        
Failed requests:                         0         
Request rate configured (RPS):           10000.00  
Benchmark duration (s):                  477.23    
Total input tokens:                      16000     
Total generated tokens:                  128000    
Request throughput (req/s):              0.03      
Output token throughput (tok/s):         268.21    
Peak output token throughput (tok/s):    92.00     
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          301.74    
---------------Time to First Token----------------
Mean TTFT (ms):                          168075.33 
Median TTFT (ms):                        166140.82 
P99 TTFT (ms):                           358607.57 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.82     
Median TPOT (ms):                        13.86     
P99 TPOT (ms):                           17.26     
---------------Inter-token Latency----------------
Mean ITL (ms):                           47.12     
Median ITL (ms):                         47.73     
P99 ITL (ms):                            51.24     
---------------Speculative Decoding---------------
Acceptance rate (%):                     80.35     
Acceptance length:                       3.41      
Drafts:                                  37532     
Draft tokens:                            112596    
Accepted tokens:                         90470     
Per-position acceptance (%):
  Position 0:                            91.78     
  Position 1:                            80.40     
  Position 2:                            68.86     
==================================================

3. Balanced

vllm bench serve \
  --model nvidia/Qwen3.6-35B-A3B-NVFP4 \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --request-rate 10000 \
  --num-prompts 16 \
  --ignore-eos

Output:

============ Serving Benchmark Result ============
Successful requests:                     16        
Failed requests:                         0         
Request rate configured (RPS):           10000.00  
Benchmark duration (s):                  64.14     
Total input tokens:                      16000     
Total generated tokens:                  16000     
Request throughput (req/s):              0.25      
Output token throughput (tok/s):         249.47    
Peak output token throughput (tok/s):    92.00     
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          498.94    
---------------Time to First Token----------------
Mean TTFT (ms):                          25433.94  
Median TTFT (ms):                        25838.86  
P99 TTFT (ms):                           53299.22  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          15.14     
Median TPOT (ms):                        15.13     
P99 TPOT (ms):                           18.88     
---------------Inter-token Latency----------------
Mean ITL (ms):                           47.63     
Median ITL (ms):                         47.01     
P99 ITL (ms):                            51.18     
---------------Speculative Decoding---------------
Acceptance rate (%):                     71.66     
Acceptance length:                       3.15      
Drafts:                                  5082      
Draft tokens:                            15246     
Accepted tokens:                         10926     
Per-position acceptance (%):
  Position 0:                            84.85     
  Position 1:                            71.88     
  Position 2:                            58.26     
==================================================

Blackwell 6000 Pro

1. Prompt-heavy

============ Serving Benchmark Result ============
Successful requests:                     16        
Failed requests:                         0         
Request rate configured (RPS):           10000.00  
Benchmark duration (s):                  46.54     
Total input tokens:                      128000    
Total generated tokens:                  16000     
Request throughput (req/s):              0.34      
Output token throughput (tok/s):         343.81    
Peak output token throughput (tok/s):    316.00    
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          3094.29   
---------------Time to First Token----------------
Mean TTFT (ms):                          29991.70  
Median TTFT (ms):                        30555.34  
P99 TTFT (ms):                           41558.57  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          5.82      
Median TPOT (ms):                        5.61      
P99 TPOT (ms):                           11.64     
---------------Inter-token Latency----------------
Mean ITL (ms):                           15.68     
Median ITL (ms):                         12.72     
P99 ITL (ms):                            143.13    
---------------Speculative Decoding---------------
Acceptance rate (%):                     56.50     
Acceptance length:                       2.69      
Drafts:                                  5936      
Draft tokens:                            17808     
Accepted tokens:                         10061     
Per-position acceptance (%):
  Position 0:                            72.29     
  Position 1:                            53.61     
  Position 2:                            43.60     
==================================================

2. Decode-heavy

============ Serving Benchmark Result ============
Successful requests:                     16        
Failed requests:                         0         
Request rate configured (RPS):           10000.00  
Benchmark duration (s):                  121.59    
Total input tokens:                      16000     
Total generated tokens:                  128000    
Request throughput (req/s):              0.13      
Output token throughput (tok/s):         1052.68   
Peak output token throughput (tok/s):    324.00    
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          1184.26   
---------------Time to First Token----------------
Mean TTFT (ms):                          45762.40  
Median TTFT (ms):                        44012.42  
P99 TTFT (ms):                           95360.31  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          3.64      
Median TPOT (ms):                        3.65      
P99 TPOT (ms):                           4.04      
---------------Inter-token Latency----------------
Mean ITL (ms):                           12.78     
Median ITL (ms):                         12.73     
P99 ITL (ms):                            13.69     
---------------Speculative Decoding---------------
Acceptance rate (%):                     83.87     
Acceptance length:                       3.52      
Drafts:                                  36407     
Draft tokens:                            109221    
Accepted tokens:                         91599     
Per-position acceptance (%):
  Position 0:                            92.03     
  Position 1:                            83.93     
  Position 2:                            75.64     
==================================================

3. Balanced

============ Serving Benchmark Result ============
Successful requests:                     16        
Failed requests:                         0         
Request rate configured (RPS):           10000.00  
Benchmark duration (s):                  19.57     
Total input tokens:                      16000     
Total generated tokens:                  16000     
Request throughput (req/s):              0.82      
Output token throughput (tok/s):         817.52    
Peak output token throughput (tok/s):    336.00    
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          1635.03   
---------------Time to First Token----------------
Mean TTFT (ms):                          7058.61   
Median TTFT (ms):                        7091.46   
P99 TTFT (ms):                           14507.52  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4.53      
Median TPOT (ms):                        4.60      
P99 TPOT (ms):                           5.25      
---------------Inter-token Latency----------------
Mean ITL (ms):                           12.54     
Median ITL (ms):                         12.32     
P99 ITL (ms):                            13.32     
---------------Speculative Decoding---------------
Acceptance rate (%):                     59.03     
Acceptance length:                       2.77      
Drafts:                                  5771      
Draft tokens:                            17313     
Accepted tokens:                         10220     
Per-position acceptance (%):
  Position 0:                            77.09     
  Position 1:                            58.10     
  Position 2:                            41.90     
==================================================

Nvidia Jetson Thor

1. Prompt-heavy

============ Serving Benchmark Result ============
Successful requests:                     16        
Failed requests:                         0         
Request rate configured (RPS):           10000.00  
Benchmark duration (s):                  128.79    
Total input tokens:                      128000    
Total generated tokens:                  16000     
Request throughput (req/s):              0.12      
Output token throughput (tok/s):         124.23    
Peak output token throughput (tok/s):    72.00     
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          1118.09   
---------------Time to First Token----------------
Mean TTFT (ms):                          63249.67  
Median TTFT (ms):                        63530.00  
P99 TTFT (ms):                           111706.64 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          24.94     
Median TPOT (ms):                        27.59     
P99 TPOT (ms):                           34.78     
---------------Inter-token Latency----------------
Mean ITL (ms):                           79.03     
Median ITL (ms):                         59.89     
P99 ITL (ms):                            1141.24   
---------------Speculative Decoding---------------
Acceptance rate (%):                     72.40     
Acceptance length:                       3.17      
Drafts:                                  5045      
Draft tokens:                            15135     
Accepted tokens:                         10958     
Per-position acceptance (%):
  Position 0:                            82.99     
  Position 1:                            72.53     
  Position 2:                            61.68     
==================================================

2. Decode-heavy

============ Serving Benchmark Result ============
Successful requests:                     16        
Failed requests:                         0         
Request rate configured (RPS):           10000.00  
Benchmark duration (s):                  535.42    
Total input tokens:                      16000     
Total generated tokens:                  128000    
Request throughput (req/s):              0.03      
Output token throughput (tok/s):         239.06    
Peak output token throughput (tok/s):    76.00     
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          268.95    
---------------Time to First Token----------------
Mean TTFT (ms):                          200813.77 
Median TTFT (ms):                        197900.86 
P99 TTFT (ms):                           404952.82 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          16.21     
Median TPOT (ms):                        16.13     
P99 TPOT (ms):                           19.33     
---------------Inter-token Latency----------------
Mean ITL (ms):                           56.90     
Median ITL (ms):                         56.78     
P99 ITL (ms):                            62.42     
---------------Speculative Decoding---------------
Acceptance rate (%):                     83.73     
Acceptance length:                       3.51      
Drafts:                                  36451     
Draft tokens:                            109353    
Accepted tokens:                         91556     
Per-position acceptance (%):
  Position 0:                            93.40     
  Position 1:                            82.52     
  Position 2:                            75.25     
==================================================

3. Balanced

============ Serving Benchmark Result ============
Successful requests:                     16        
Failed requests:                         0         
Request rate configured (RPS):           10000.00  
Benchmark duration (s):                  83.89     
Total input tokens:                      16000     
Total generated tokens:                  16000     
Request throughput (req/s):              0.19      
Output token throughput (tok/s):         190.73    
Peak output token throughput (tok/s):    84.00     
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          381.47    
---------------Time to First Token----------------
Mean TTFT (ms):                          30657.07  
Median TTFT (ms):                        30225.81  
P99 TTFT (ms):                           68555.21  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.38     
Median TPOT (ms):                        19.02     
P99 TPOT (ms):                           31.65     
---------------Inter-token Latency----------------
Mean ITL (ms):                           52.71     
Median ITL (ms):                         51.47     
P99 ITL (ms):                            56.62     
---------------Speculative Decoding---------------
Acceptance rate (%):                     57.46     
Acceptance length:                       2.72      
Drafts:                                  5876      
Draft tokens:                            17628     
Accepted tokens:                         10129     
Per-position acceptance (%):
  Position 0:                            71.15     
  Position 1:                            58.34     
  Position 2:                            42.89     
==================================================

Cool thanks, it would be helpful to see how they compare to each other all in the same chart or graph(s).

Oh yes, you are right. I used LLM to write it.

Prompt-Heavy Workload (8K input / 1K output)

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ Platform             │ Output Tok/s │ TTFT (mean) │ TPOT (mean) │ Duration │
ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¤
│ Blackwell 6000 Pro   │       343.81 │      29.99s │       5.82ms │   46.54s │
│ DGX Spark            │       171.64 │      42.24s │      18.76ms │   93.22s │
│ Jetson Thor          │       124.23 │      63.25s │      24.94ms │  128.79s │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Decode-Heavy Workload (1K input / 8K output)

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ Platform             │ Output Tok/s │ TTFT (mean) │ TPOT (mean) │ Duration │
ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¤
│ Blackwell 6000 Pro   │     1,052.68 │      45.76s │       3.64ms │  121.59s │
│ DGX Spark            │       268.21 │     168.08s │      13.82ms │  477.23s │
│ Jetson Thor          │       239.06 │     200.81s │      16.21ms │  535.42s │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Balanced Workload (1K input / 1K output)

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ Platform             │ Output Tok/s │ TTFT (mean) │ TPOT (mean) │ Duration │
ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¤
│ Blackwell 6000 Pro   │       817.52 │       7.06s │       4.53ms │   19.57s │
│ DGX Spark            │       249.47 │      25.43s │      15.14ms │   64.14s │
│ Jetson Thor          │       190.73 │      30.66s │      19.38ms │   83.89s │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Throughput overall will continue to increase at higher concurrency. Diminishing returns in my experience kick in well above 16, more like 64.

For DGX Spark they recommend these envs:

export VLLM_USE_FLASHINFER_MOE_FP4=0

export VLLM_FP8_MOE_BACKEND=flashinfer_cutlass

export FLASHINFER_DISABLE_VERSION_CHECK=1

export CUTE_DSL_ARCH=sm_121a

Did you set those?