Command templates:
docker exec -it vllm_node bash -i -c “vllm serve M --host 0.0.0.0 --trust_remote_code --gpu-memory-utilization 0.8 -pp 1 -tp X --distributed-executor-backend ray --load-format fastsafetensors --kv-cache-dtype fp8”
vllm bench serve --backend vllm --model M --host 10.20.0.4 --endpoint /v1/completions --hf-name sharegpt --num-prompts X --port 8000
Qwen/Qwen3-VL-32B-Instruct-FP8
4 nodes (tp)
============ Serving Benchmark Result ============
Successful requests: 1
Failed requests: 0
Benchmark duration (s): 6.68
Total input tokens: 1024
Total generated tokens: 128
Request throughput (req/s): 0.15
Output token throughput (tok/s): 19.15
Peak output token throughput (tok/s): 20.00
Peak concurrent requests: 1.00
Total token throughput (tok/s): 172.39
---------------Time to First Token----------------
Mean TTFT (ms): 83.91
Median TTFT (ms): 83.91
P99 TTFT (ms): 83.91
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 51.95
Median TPOT (ms): 51.95
P99 TPOT (ms): 51.95
---------------Inter-token Latency----------------
Mean ITL (ms): 51.95
Median ITL (ms): 51.75
P99 ITL (ms): 55.00
==================================================
============ Serving Benchmark Result ============
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 10.52
Total input tokens: 10240
Total generated tokens: 1280
Request throughput (req/s): 0.95
Output token throughput (tok/s): 121.64
Peak output token throughput (tok/s): 170.00
Peak concurrent requests: 10.00
Total token throughput (tok/s): 1094.79
---------------Time to First Token----------------
Mean TTFT (ms): 1693.08
Median TTFT (ms): 1731.72
P99 TTFT (ms): 2623.26
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 68.42
Median TPOT (ms): 68.12
P99 TPOT (ms): 76.79
---------------Inter-token Latency----------------
Mean ITL (ms): 68.42
Median ITL (ms): 62.29
P99 ITL (ms): 515.58
==================================================
============ Serving Benchmark Result ============
Successful requests: 100
Failed requests: 0
Benchmark duration (s): 37.87
Total input tokens: 102400
Total generated tokens: 12800
Request throughput (req/s): 2.64
Output token throughput (tok/s): 338.03
Peak output token throughput (tok/s): 900.00
Peak concurrent requests: 100.00
Total token throughput (tok/s): 3042.23
---------------Time to First Token----------------
Mean TTFT (ms): 11606.98
Median TTFT (ms): 11121.38
P99 TTFT (ms): 24464.04
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 190.38
Median TPOT (ms): 193.39
P99 TPOT (ms): 260.85
---------------Inter-token Latency----------------
Mean ITL (ms): 190.38
Median ITL (ms): 111.25
P99 ITL (ms): 539.90
==================================================
Qwen/Qwen3-VL-235B-A22B-Instruct-FP8
4 node (tp)
============ Serving Benchmark Result ============
Successful requests: 1
Failed requests: 0
Benchmark duration (s): 5.76
Total input tokens: 1024
Total generated tokens: 128
Request throughput (req/s): 0.17
Output token throughput (tok/s): 22.23
Peak output token throughput (tok/s): 23.00
Peak concurrent requests: 1.00
Total token throughput (tok/s): 200.04
---------------Time to First Token----------------
Mean TTFT (ms): 127.74
Median TTFT (ms): 127.74
P99 TTFT (ms): 127.74
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 44.34
Median TPOT (ms): 44.34
P99 TPOT (ms): 44.34
---------------Inter-token Latency----------------
Mean ITL (ms): 44.34
Median ITL (ms): 43.91
P99 ITL (ms): 47.46
==================================================
============ Serving Benchmark Result ============
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 24.28
Total input tokens: 10240
Total generated tokens: 1280
Request throughput (req/s): 0.41
Output token throughput (tok/s): 52.72
Peak output token throughput (tok/s): 70.00
Peak concurrent requests: 10.00
Total token throughput (tok/s): 474.52
---------------Time to First Token----------------
Mean TTFT (ms): 2665.94
Median TTFT (ms): 2717.83
P99 TTFT (ms): 4135.27
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 168.49
Median TPOT (ms): 168.25
P99 TPOT (ms): 180.70
---------------Inter-token Latency----------------
Mean ITL (ms): 168.49
Median ITL (ms): 161.96
P99 ITL (ms): 788.41
==================================================
============ Serving Benchmark Result ============
Successful requests: 100
Failed requests: 0
Benchmark duration (s): 74.38
Total input tokens: 102400
Total generated tokens: 12800
Request throughput (req/s): 1.34
Output token throughput (tok/s): 172.08
Peak output token throughput (tok/s): 400.00
Peak concurrent requests: 100.00
Total token throughput (tok/s): 1548.71
---------------Time to First Token----------------
Mean TTFT (ms): 17847.62
Median TTFT (ms): 17271.93
P99 TTFT (ms): 38251.50
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 401.05
Median TPOT (ms): 405.94
P99 TPOT (ms): 489.78
---------------Inter-token Latency----------------
Mean ITL (ms): 401.05
Median ITL (ms): 301.32
P99 ITL (ms): 846.77
==================================================
QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ
4 node (tp)
============ Serving Benchmark Result ============
Successful requests: 1
Failed requests: 0
Benchmark duration (s): 3.95
Total input tokens: 1024
Total generated tokens: 128
Request throughput (req/s): 0.25
Output token throughput (tok/s): 32.37
Peak output token throughput (tok/s): 33.00
Peak concurrent requests: 1.00
Total token throughput (tok/s): 291.31
---------------Time to First Token----------------
Mean TTFT (ms): 88.34
Median TTFT (ms): 88.34
P99 TTFT (ms): 88.34
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 30.44
Median TPOT (ms): 30.44
P99 TPOT (ms): 30.44
---------------Inter-token Latency----------------
Mean ITL (ms): 30.44
Median ITL (ms): 30.26
P99 ITL (ms): 32.65
==================================================
============ Serving Benchmark Result ============
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 15.86
Total input tokens: 10240
Total generated tokens: 1280
Request throughput (req/s): 0.63
Output token throughput (tok/s): 80.71
Peak output token throughput (tok/s): 110.00
Peak concurrent requests: 10.00
Total token throughput (tok/s): 726.39
---------------Time to First Token----------------
Mean TTFT (ms): 2363.01
Median TTFT (ms): 2390.65
P99 TTFT (ms): 3841.38
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 105.40
Median TPOT (ms): 105.28
P99 TPOT (ms): 116.69
---------------Inter-token Latency----------------
Mean ITL (ms): 105.40
Median ITL (ms): 96.24
P99 ITL (ms): 660.66
==================================================
GPT-OSS-20B
4 node (tp)
============ Serving Benchmark Result ============
Successful requests: 1
Failed requests: 0
Benchmark duration (s): 1.50
Total input tokens: 1024
Total generated tokens: 128
Request throughput (req/s): 0.67
Output token throughput (tok/s): 85.25
Peak output token throughput (tok/s): 83.00
Peak concurrent requests: 1.00
Total token throughput (tok/s): 767.26
---------------Time to First Token----------------
Mean TTFT (ms): 41.07
Median TTFT (ms): 41.07
P99 TTFT (ms): 41.07
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 11.50
Median TPOT (ms): 11.50
P99 TPOT (ms): 11.50
---------------Inter-token Latency----------------
Mean ITL (ms): 11.50
Median ITL (ms): 10.66
P99 ITL (ms): 20.37
==================================================
============ Serving Benchmark Result ============
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 3.70
Total input tokens: 10240
Total generated tokens: 1280
Request throughput (req/s): 2.71
Output token throughput (tok/s): 346.25
Peak output token throughput (tok/s): 460.00
Peak concurrent requests: 10.00
Total token throughput (tok/s): 3116.22
---------------Time to First Token----------------
Mean TTFT (ms): 643.37
Median TTFT (ms): 613.75
P99 TTFT (ms): 1084.12
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 23.79
Median TPOT (ms): 24.05
P99 TPOT (ms): 27.63
---------------Inter-token Latency----------------
Mean ITL (ms): 23.79
Median ITL (ms): 19.11
P99 ITL (ms): 179.76
==================================================
============ Serving Benchmark Result ============
Successful requests: 100
Failed requests: 0
Benchmark duration (s): 12.81
Total input tokens: 102400
Total generated tokens: 12800
Request throughput (req/s): 7.80
Output token throughput (tok/s): 999.02
Peak output token throughput (tok/s): 2800.00
Peak concurrent requests: 100.00
Total token throughput (tok/s): 8991.14
---------------Time to First Token----------------
Mean TTFT (ms): 3979.85
Median TTFT (ms): 3904.99
P99 TTFT (ms): 8486.11
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 64.12
Median TPOT (ms): 65.14
P99 TPOT (ms): 87.99
---------------Inter-token Latency----------------
Mean ITL (ms): 64.12
Median ITL (ms): 35.34
P99 ITL (ms): 187.69
==================================================
GPT-OSS-120B
8 node (tp)
============ Serving Benchmark Result ============
Successful requests: 1
Failed requests: 0
Benchmark duration (s): 2.07
Total input tokens: 1024
Total generated tokens: 128
Request throughput (req/s): 0.48
Output token throughput (tok/s): 61.86
Peak output token throughput (tok/s): 64.00
Peak concurrent requests: 1.00
Total token throughput (tok/s): 556.70
---------------Time to First Token----------------
Mean TTFT (ms): 47.08
Median TTFT (ms): 47.08
P99 TTFT (ms): 47.08
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 15.92
Median TPOT (ms): 15.92
P99 TPOT (ms): 15.92
---------------Inter-token Latency----------------
Mean ITL (ms): 15.92
Median ITL (ms): 14.10
P99 ITL (ms): 24.64
==================================================
============ Serving Benchmark Result ============
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 5.01
Total input tokens: 10240
Total generated tokens: 1280
Request throughput (req/s): 2.00
Output token throughput (tok/s): 255.50
Peak output token throughput (tok/s): 360.00
Peak concurrent requests: 10.00
Total token throughput (tok/s): 2299.53
---------------Time to First Token----------------
Mean TTFT (ms): 877.16
Median TTFT (ms): 928.99
P99 TTFT (ms): 1400.55
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 32.17
Median TPOT (ms): 31.83
P99 TPOT (ms): 37.86
---------------Inter-token Latency----------------
Mean ITL (ms): 32.17
Median ITL (ms): 25.62
P99 ITL (ms): 214.43
==================================================
============ Serving Benchmark Result ============
Successful requests: 100
Failed requests: 0
Benchmark duration (s): 21.57
Total input tokens: 102400
Total generated tokens: 12800
Request throughput (req/s): 4.64
Output token throughput (tok/s): 593.55
Peak output token throughput (tok/s): 1600.00
Peak concurrent requests: 100.00
Total token throughput (tok/s): 5341.94
---------------Time to First Token----------------
Mean TTFT (ms): 6357.06
Median TTFT (ms): 6132.10
P99 TTFT (ms): 13638.07
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 111.45
Median TPOT (ms): 114.05
P99 TPOT (ms): 149.78
---------------Inter-token Latency----------------
Mean ITL (ms): 111.45
Median ITL (ms): 63.22
P99 ITL (ms): 306.93
==================================================
4 node (tp)
============ Serving Benchmark Result ============
Successful requests: 1
Failed requests: 0
Benchmark duration (s): 1.79
Total input tokens: 1024
Total generated tokens: 128
Request throughput (req/s): 0.56
Output token throughput (tok/s): 71.54
Peak output token throughput (tok/s): 71.00
Peak concurrent requests: 1.00
Total token throughput (tok/s): 643.89
---------------Time to First Token----------------
Mean TTFT (ms): 37.10
Median TTFT (ms): 37.10
P99 TTFT (ms): 37.10
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 13.79
Median TPOT (ms): 13.79
P99 TPOT (ms): 13.79
---------------Inter-token Latency----------------
Mean ITL (ms): 13.79
Median ITL (ms): 13.73
P99 ITL (ms): 15.01
==================================================
============ Serving Benchmark Result ============
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 5.21
Total input tokens: 10240
Total generated tokens: 1280
Request throughput (req/s): 1.92
Output token throughput (tok/s): 245.57
Peak output token throughput (tok/s): 340.00
Peak concurrent requests: 10.00
Total token throughput (tok/s): 2210.11
---------------Time to First Token----------------
Mean TTFT (ms): 908.51
Median TTFT (ms): 962.65
P99 TTFT (ms): 1462.06
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 33.48
Median TPOT (ms): 33.10
P99 TPOT (ms): 38.47
---------------Inter-token Latency----------------
Mean ITL (ms): 33.48
Median ITL (ms): 29.98
P99 ITL (ms): 298.82
==================================================
============ Serving Benchmark Result ============
Successful requests: 100
Failed requests: 0
Benchmark duration (s): 24.67
Total input tokens: 102400
Total generated tokens: 12800
Request throughput (req/s): 4.05
Output token throughput (tok/s): 518.78
Peak output token throughput (tok/s): 1300.00
Peak concurrent requests: 100.00
Total token throughput (tok/s): 4669.03
---------------Time to First Token----------------
Mean TTFT (ms): 6719.06
Median TTFT (ms): 6547.16
P99 TTFT (ms): 14641.29
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 130.26
Median TPOT (ms): 132.70
P99 TPOT (ms): 167.78
---------------Inter-token Latency----------------
Mean ITL (ms): 130.26
Median ITL (ms): 84.86
P99 ITL (ms): 424.07
==================================================
zai-org/GLM-4.6-FP8
4 node (tp)
============ Serving Benchmark Result ============
Successful requests: 1
Failed requests: 0
Benchmark duration (s): 8.40
Total input tokens: 1024
Total generated tokens: 128
Request throughput (req/s): 0.12
Output token throughput (tok/s): 15.23
Peak output token throughput (tok/s): 16.00
Peak concurrent requests: 1.00
Total token throughput (tok/s): 137.08
---------------Time to First Token----------------
Mean TTFT (ms): 224.50
Median TTFT (ms): 224.50
P99 TTFT (ms): 224.50
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 64.40
Median TPOT (ms): 64.40
P99 TPOT (ms): 64.40
---------------Inter-token Latency----------------
Mean ITL (ms): 64.40
Median ITL (ms): 64.32
P99 ITL (ms): 66.00
==================================================
============ Serving Benchmark Result ============
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 40.67
Total input tokens: 10240
Total generated tokens: 1280
Request throughput (req/s): 0.25
Output token throughput (tok/s): 31.48
Peak output token throughput (tok/s): 40.00
Peak concurrent requests: 10.00
Total token throughput (tok/s): 283.28
---------------Time to First Token----------------
Mean TTFT (ms): 5797.18
Median TTFT (ms): 5759.51
P99 TTFT (ms): 8694.79
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 272.29
Median TPOT (ms): 272.72
P99 TPOT (ms): 301.66
---------------Inter-token Latency----------------
Mean ITL (ms): 272.29
Median ITL (ms): 257.01
P99 ITL (ms): 1718.29
==================================================
============ Serving Benchmark Result ============
Successful requests: 100
Failed requests: 0
Benchmark duration (s): 142.28
Total input tokens: 102400
Total generated tokens: 12800
Request throughput (req/s): 0.70
Output token throughput (tok/s): 89.96
Peak output token throughput (tok/s): 200.00
Peak concurrent requests: 100.00
Total token throughput (tok/s): 809.67
---------------Time to First Token----------------
Mean TTFT (ms): 38421.92
Median TTFT (ms): 36869.57
P99 TTFT (ms): 81050.88
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 749.07
Median TPOT (ms): 761.59
P99 TPOT (ms): 962.73
---------------Inter-token Latency----------------
Mean ITL (ms): 749.07
Median ITL (ms): 514.18
P99 ITL (ms): 1774.77
==================================================
nvidia/Llama-4-Scout-17B-16E-Instruct-NVFP4
4 node (tp)
============ Serving Benchmark Result ============
Successful requests: 1
Failed requests: 0
Benchmark duration (s): 4.58
Total input tokens: 1023
Total generated tokens: 128
Request throughput (req/s): 0.22
Output token throughput (tok/s): 27.93
Peak output token throughput (tok/s): 30.00
Peak concurrent requests: 1.00
Total token throughput (tok/s): 251.18
---------------Time to First Token----------------
Mean TTFT (ms): 382.88
Median TTFT (ms): 382.88
P99 TTFT (ms): 382.88
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 33.07
Median TPOT (ms): 33.07
P99 TPOT (ms): 33.07
---------------Inter-token Latency----------------
Mean ITL (ms): 33.07
Median ITL (ms): 30.99
P99 ITL (ms): 43.77
==================================================
============ Serving Benchmark Result ============
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 11.95
Total input tokens: 10230
Total generated tokens: 1280
Request throughput (req/s): 0.84
Output token throughput (tok/s): 107.10
Peak output token throughput (tok/s): 150.00
Peak concurrent requests: 10.00
Total token throughput (tok/s): 963.11
---------------Time to First Token----------------
Mean TTFT (ms): 1712.12
Median TTFT (ms): 1843.62
P99 TTFT (ms): 2657.49
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 76.48
Median TPOT (ms): 74.81
P99 TPOT (ms): 84.39
---------------Inter-token Latency----------------
Mean ITL (ms): 76.48
Median ITL (ms): 67.11
P99 ITL (ms): 394.16
==================================================
============ Serving Benchmark Result ============
Successful requests: 100
Failed requests: 0
Benchmark duration (s): 41.58
Total input tokens: 102300
Total generated tokens: 12800
Request throughput (req/s): 2.40
Output token throughput (tok/s): 307.83
Peak output token throughput (tok/s): 800.00
Peak concurrent requests: 100.00
Total token throughput (tok/s): 2768.06
---------------Time to First Token----------------
Mean TTFT (ms): 10097.28
Median TTFT (ms): 9424.95
P99 TTFT (ms): 22795.64
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 222.56
Median TPOT (ms): 227.89
P99 TPOT (ms): 269.39
---------------Inter-token Latency----------------
Mean ITL (ms): 222.56
Median ITL (ms): 140.08
P99 ITL (ms): 618.16
==================================================
nvidia/Llama-3.3-70B-Instruct-NVFP4
4 node (tp)
============ Serving Benchmark Result ============
Successful requests: 1
Failed requests: 0
Benchmark duration (s): 10.03
Total input tokens: 1023
Total generated tokens: 128
Request throughput (req/s): 0.10
Output token throughput (tok/s): 12.77
Peak output token throughput (tok/s): 14.00
Peak concurrent requests: 1.00
Total token throughput (tok/s): 114.79
---------------Time to First Token----------------
Mean TTFT (ms): 312.52
Median TTFT (ms): 312.52
P99 TTFT (ms): 312.52
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 76.49
Median TPOT (ms): 76.49
P99 TPOT (ms): 76.49
---------------Inter-token Latency----------------
Mean ITL (ms): 76.49
Median ITL (ms): 73.79
P99 ITL (ms): 87.22
==================================================
============ Serving Benchmark Result ============
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 21.36
Total input tokens: 10230
Total generated tokens: 1280
Request throughput (req/s): 0.47
Output token throughput (tok/s): 59.93
Peak output token throughput (tok/s): 90.00
Peak concurrent requests: 10.00
Total token throughput (tok/s): 538.92
---------------Time to First Token----------------
Mean TTFT (ms): 3472.04
Median TTFT (ms): 3544.04
P99 TTFT (ms): 5023.57
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 138.56
Median TPOT (ms): 138.05
P99 TPOT (ms): 151.98
---------------Inter-token Latency----------------
Mean ITL (ms): 138.56
Median ITL (ms): 125.58
P99 ITL (ms): 839.30
==================================================
============ Serving Benchmark Result ============
Successful requests: 100
Failed requests: 0
Benchmark duration (s): 69.30
Total input tokens: 102300
Total generated tokens: 12800
Request throughput (req/s): 1.44
Output token throughput (tok/s): 184.69
Peak output token throughput (tok/s): 600.00
Peak concurrent requests: 100.00
Total token throughput (tok/s): 1660.78
---------------Time to First Token----------------
Mean TTFT (ms): 23961.51
Median TTFT (ms): 23834.79
P99 TTFT (ms): 47879.51
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 331.80
Median TPOT (ms): 333.44
P99 TPOT (ms): 481.50
---------------Inter-token Latency----------------
Mean ITL (ms): 331.80
Median ITL (ms): 182.62
P99 ITL (ms): 1731.80
==================================================
Summary of Results
-
GPT-OSS-20B is the throughput champion. It delivers the highest raw performance across all concurrency levels, hitting ~9k tok/s total at 100 concurrent requests with remarkably low TPOT (64ms). Single-request latency is excellent (TTFT 41ms), and it scales gracefully under load.
-
GPT-OSS-120B offers the best balance for a large model. It has the snappiest single-request behavior (TTFT 37ms) and maintains reasonable latency even at scale, reaching ~4.7k tok/s total at 100 concurrent. TPOT stays controlled (130ms at 100 reqs) compared to other big models.
-
Qwen3-VL-32B-FP8 is solid for moderate workloads. Single-request latency is acceptable (TTFT 84ms), and it reaches ~3k tok/s total at 100 concurrent. However, TTFT climbs significantly under load (≈11.6s at 100 reqs), making it feel sluggish for interactive use at high concurrency.
-
Llama-4-Scout-17B-16E-NVFP4 performs similarly to Qwen3-VL-32B under load. Comparable scaling behavior (TTFT ≈10s at 100 reqs, ~2.8k tok/s total), though single-request TTFT is higher (383ms) due to MoE routing overhead.
-
Qwen3-VL-235B-A22B-AWQ improves significantly over the FP8 variant at low concurrency. Single-request TPOT drops from 44ms to 30ms, and TTFT from 128ms to 88ms. At 10 concurrent, it’s still faster (TPOT 105ms vs 168ms), making AWQ worthwhile for latency-sensitive deployments of this model.
-
Qwen3-VL-235B-A22B-FP8 is strongly latency-bound. Acceptable at single requests, but TTFT explodes with concurrency (≈17.8s at 100 reqs) and TPOT becomes very high (401ms). Throughput caps around ~1.5k tok/s total.
-
Llama-3.3-70B-NVFP4 struggles with the FP4 quantization overhead. Despite being smaller than GPT-OSS-120B, it’s slower across the board—higher TTFT, worse TPOT, and lower throughput (~1.7k tok/s at 100 concurrent).
-
GLM-4.6-FP8 degrades the hardest under load. TTFT becomes extreme (≈38s at 100 reqs) and TPOT balloons to 749ms. Not suitable for interactive or high-concurrency serving.