Hello,
I conducted vllm benchmarks of the nvidia/Qwen3.6-35B-A3B-NVFP4 model across three NVIDIA platforms: Jetson Thor, DGX Spark, and Blackwell 6000 Pro. All tests used identical vllm configurations with NVFP4 quantization, flashInfer attention, Marlin MoE backend, and MTP speculative decoding.
I installed the nightly release of vllm using the following command:
uv pip install -U vllm \
--torch-backend=auto \
--extra-index-url https://wheels.vllm.ai/nightly
Test different workloads by adjusting input/output lengths:
- Prompt-heavy: 8000 input / 1000 output
- Decode-heavy: 1000 input / 8000 output
- Balanced: 1000 input / 1000 output
The same command for every machine:
vllm serve nvidia/Qwen3.6-35B-A3B-NVFP4 \
--port 8000 \
--tensor-parallel-size 1 \
--trust-remote-code \
--dtype auto \
--quantization modelopt \
--kv-cache-dtype fp8 \
--attention-backend flashinfer \
--moe-backend marlin \
--gpu-memory-utilization 0.85 \
--max-model-len 65536 \
--max-num-seqs 4 \
--max-num-batched-tokens 8192 \
--enable-chunked-prefill \
--async-scheduling \
--enable-prefix-caching \
--speculative-config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}'
Nvidia DGX Spark
1. Prompt-heavy
vllm bench serve \
--model nvidia/Qwen3.6-35B-A3B-NVFP4 \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--request-rate 10000 \
--num-prompts 16 \
--ignore-eos
Output:
============ Serving Benchmark Result ============
Successful requests: 16
Failed requests: 0
Request rate configured (RPS): 10000.00
Benchmark duration (s): 93.22
Total input tokens: 128000
Total generated tokens: 16000
Request throughput (req/s): 0.17
Output token throughput (tok/s): 171.64
Peak output token throughput (tok/s): 92.00
Peak concurrent requests: 16.00
Total token throughput (tok/s): 1544.75
---------------Time to First Token----------------
Mean TTFT (ms): 42235.75
Median TTFT (ms): 42243.32
P99 TTFT (ms): 76218.08
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 18.76
Median TPOT (ms): 18.71
P99 TPOT (ms): 26.40
---------------Inter-token Latency----------------
Mean ITL (ms): 57.44
Median ITL (ms): 48.03
P99 ITL (ms): 621.62
---------------Speculative Decoding---------------
Acceptance rate (%): 68.81
Acceptance length: 3.06
Drafts: 5221
Draft tokens: 15663
Accepted tokens: 10778
Per-position acceptance (%):
Position 0: 80.21
Position 1: 67.96
Position 2: 58.26
==================================================
2. Decode-heavy
vllm bench serve \
--model nvidia/Qwen3.6-35B-A3B-NVFP4 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 8000 \
--request-rate 10000 \
--num-prompts 16 \
--ignore-eos
Output:
============ Serving Benchmark Result ============
Successful requests: 16
Failed requests: 0
Request rate configured (RPS): 10000.00
Benchmark duration (s): 477.23
Total input tokens: 16000
Total generated tokens: 128000
Request throughput (req/s): 0.03
Output token throughput (tok/s): 268.21
Peak output token throughput (tok/s): 92.00
Peak concurrent requests: 16.00
Total token throughput (tok/s): 301.74
---------------Time to First Token----------------
Mean TTFT (ms): 168075.33
Median TTFT (ms): 166140.82
P99 TTFT (ms): 358607.57
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 13.82
Median TPOT (ms): 13.86
P99 TPOT (ms): 17.26
---------------Inter-token Latency----------------
Mean ITL (ms): 47.12
Median ITL (ms): 47.73
P99 ITL (ms): 51.24
---------------Speculative Decoding---------------
Acceptance rate (%): 80.35
Acceptance length: 3.41
Drafts: 37532
Draft tokens: 112596
Accepted tokens: 90470
Per-position acceptance (%):
Position 0: 91.78
Position 1: 80.40
Position 2: 68.86
==================================================
3. Balanced
vllm bench serve \
--model nvidia/Qwen3.6-35B-A3B-NVFP4 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--request-rate 10000 \
--num-prompts 16 \
--ignore-eos
Output:
============ Serving Benchmark Result ============
Successful requests: 16
Failed requests: 0
Request rate configured (RPS): 10000.00
Benchmark duration (s): 64.14
Total input tokens: 16000
Total generated tokens: 16000
Request throughput (req/s): 0.25
Output token throughput (tok/s): 249.47
Peak output token throughput (tok/s): 92.00
Peak concurrent requests: 16.00
Total token throughput (tok/s): 498.94
---------------Time to First Token----------------
Mean TTFT (ms): 25433.94
Median TTFT (ms): 25838.86
P99 TTFT (ms): 53299.22
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 15.14
Median TPOT (ms): 15.13
P99 TPOT (ms): 18.88
---------------Inter-token Latency----------------
Mean ITL (ms): 47.63
Median ITL (ms): 47.01
P99 ITL (ms): 51.18
---------------Speculative Decoding---------------
Acceptance rate (%): 71.66
Acceptance length: 3.15
Drafts: 5082
Draft tokens: 15246
Accepted tokens: 10926
Per-position acceptance (%):
Position 0: 84.85
Position 1: 71.88
Position 2: 58.26
==================================================
Blackwell 6000 Pro
1. Prompt-heavy
============ Serving Benchmark Result ============
Successful requests: 16
Failed requests: 0
Request rate configured (RPS): 10000.00
Benchmark duration (s): 46.54
Total input tokens: 128000
Total generated tokens: 16000
Request throughput (req/s): 0.34
Output token throughput (tok/s): 343.81
Peak output token throughput (tok/s): 316.00
Peak concurrent requests: 16.00
Total token throughput (tok/s): 3094.29
---------------Time to First Token----------------
Mean TTFT (ms): 29991.70
Median TTFT (ms): 30555.34
P99 TTFT (ms): 41558.57
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 5.82
Median TPOT (ms): 5.61
P99 TPOT (ms): 11.64
---------------Inter-token Latency----------------
Mean ITL (ms): 15.68
Median ITL (ms): 12.72
P99 ITL (ms): 143.13
---------------Speculative Decoding---------------
Acceptance rate (%): 56.50
Acceptance length: 2.69
Drafts: 5936
Draft tokens: 17808
Accepted tokens: 10061
Per-position acceptance (%):
Position 0: 72.29
Position 1: 53.61
Position 2: 43.60
==================================================
2. Decode-heavy
============ Serving Benchmark Result ============
Successful requests: 16
Failed requests: 0
Request rate configured (RPS): 10000.00
Benchmark duration (s): 121.59
Total input tokens: 16000
Total generated tokens: 128000
Request throughput (req/s): 0.13
Output token throughput (tok/s): 1052.68
Peak output token throughput (tok/s): 324.00
Peak concurrent requests: 16.00
Total token throughput (tok/s): 1184.26
---------------Time to First Token----------------
Mean TTFT (ms): 45762.40
Median TTFT (ms): 44012.42
P99 TTFT (ms): 95360.31
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 3.64
Median TPOT (ms): 3.65
P99 TPOT (ms): 4.04
---------------Inter-token Latency----------------
Mean ITL (ms): 12.78
Median ITL (ms): 12.73
P99 ITL (ms): 13.69
---------------Speculative Decoding---------------
Acceptance rate (%): 83.87
Acceptance length: 3.52
Drafts: 36407
Draft tokens: 109221
Accepted tokens: 91599
Per-position acceptance (%):
Position 0: 92.03
Position 1: 83.93
Position 2: 75.64
==================================================
3. Balanced
============ Serving Benchmark Result ============
Successful requests: 16
Failed requests: 0
Request rate configured (RPS): 10000.00
Benchmark duration (s): 19.57
Total input tokens: 16000
Total generated tokens: 16000
Request throughput (req/s): 0.82
Output token throughput (tok/s): 817.52
Peak output token throughput (tok/s): 336.00
Peak concurrent requests: 16.00
Total token throughput (tok/s): 1635.03
---------------Time to First Token----------------
Mean TTFT (ms): 7058.61
Median TTFT (ms): 7091.46
P99 TTFT (ms): 14507.52
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 4.53
Median TPOT (ms): 4.60
P99 TPOT (ms): 5.25
---------------Inter-token Latency----------------
Mean ITL (ms): 12.54
Median ITL (ms): 12.32
P99 ITL (ms): 13.32
---------------Speculative Decoding---------------
Acceptance rate (%): 59.03
Acceptance length: 2.77
Drafts: 5771
Draft tokens: 17313
Accepted tokens: 10220
Per-position acceptance (%):
Position 0: 77.09
Position 1: 58.10
Position 2: 41.90
==================================================
Nvidia Jetson Thor
1. Prompt-heavy
============ Serving Benchmark Result ============
Successful requests: 16
Failed requests: 0
Request rate configured (RPS): 10000.00
Benchmark duration (s): 128.79
Total input tokens: 128000
Total generated tokens: 16000
Request throughput (req/s): 0.12
Output token throughput (tok/s): 124.23
Peak output token throughput (tok/s): 72.00
Peak concurrent requests: 16.00
Total token throughput (tok/s): 1118.09
---------------Time to First Token----------------
Mean TTFT (ms): 63249.67
Median TTFT (ms): 63530.00
P99 TTFT (ms): 111706.64
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 24.94
Median TPOT (ms): 27.59
P99 TPOT (ms): 34.78
---------------Inter-token Latency----------------
Mean ITL (ms): 79.03
Median ITL (ms): 59.89
P99 ITL (ms): 1141.24
---------------Speculative Decoding---------------
Acceptance rate (%): 72.40
Acceptance length: 3.17
Drafts: 5045
Draft tokens: 15135
Accepted tokens: 10958
Per-position acceptance (%):
Position 0: 82.99
Position 1: 72.53
Position 2: 61.68
==================================================
2. Decode-heavy
============ Serving Benchmark Result ============
Successful requests: 16
Failed requests: 0
Request rate configured (RPS): 10000.00
Benchmark duration (s): 535.42
Total input tokens: 16000
Total generated tokens: 128000
Request throughput (req/s): 0.03
Output token throughput (tok/s): 239.06
Peak output token throughput (tok/s): 76.00
Peak concurrent requests: 16.00
Total token throughput (tok/s): 268.95
---------------Time to First Token----------------
Mean TTFT (ms): 200813.77
Median TTFT (ms): 197900.86
P99 TTFT (ms): 404952.82
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 16.21
Median TPOT (ms): 16.13
P99 TPOT (ms): 19.33
---------------Inter-token Latency----------------
Mean ITL (ms): 56.90
Median ITL (ms): 56.78
P99 ITL (ms): 62.42
---------------Speculative Decoding---------------
Acceptance rate (%): 83.73
Acceptance length: 3.51
Drafts: 36451
Draft tokens: 109353
Accepted tokens: 91556
Per-position acceptance (%):
Position 0: 93.40
Position 1: 82.52
Position 2: 75.25
==================================================
3. Balanced
============ Serving Benchmark Result ============
Successful requests: 16
Failed requests: 0
Request rate configured (RPS): 10000.00
Benchmark duration (s): 83.89
Total input tokens: 16000
Total generated tokens: 16000
Request throughput (req/s): 0.19
Output token throughput (tok/s): 190.73
Peak output token throughput (tok/s): 84.00
Peak concurrent requests: 16.00
Total token throughput (tok/s): 381.47
---------------Time to First Token----------------
Mean TTFT (ms): 30657.07
Median TTFT (ms): 30225.81
P99 TTFT (ms): 68555.21
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 19.38
Median TPOT (ms): 19.02
P99 TPOT (ms): 31.65
---------------Inter-token Latency----------------
Mean ITL (ms): 52.71
Median ITL (ms): 51.47
P99 ITL (ms): 56.62
---------------Speculative Decoding---------------
Acceptance rate (%): 57.46
Acceptance length: 2.72
Drafts: 5876
Draft tokens: 17628
Accepted tokens: 10129
Per-position acceptance (%):
Position 0: 71.15
Position 1: 58.34
Position 2: 42.89
==================================================