6x Spark setup

Command templates:

docker exec -it vllm_node bash -i -c “vllm serve M --host 0.0.0.0 --trust_remote_code --gpu-memory-utilization 0.8 -pp 1 -tp X --distributed-executor-backend ray --load-format fastsafetensors --kv-cache-dtype fp8”
vllm bench serve --backend vllm --model M --host 10.20.0.4 --endpoint /v1/completions --hf-name sharegpt --num-prompts X --port 8000

Qwen/Qwen3-VL-32B-Instruct-FP8

4 nodes (tp)

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  6.68      
Total input tokens:                      1024      
Total generated tokens:                  128       
Request throughput (req/s):              0.15      
Output token throughput (tok/s):         19.15     
Peak output token throughput (tok/s):    20.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          172.39    
---------------Time to First Token----------------
Mean TTFT (ms):                          83.91     
Median TTFT (ms):                        83.91     
P99 TTFT (ms):                           83.91     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          51.95     
Median TPOT (ms):                        51.95     
P99 TPOT (ms):                           51.95     
---------------Inter-token Latency----------------
Mean ITL (ms):                           51.95     
Median ITL (ms):                         51.75     
P99 ITL (ms):                            55.00     
==================================================
============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  10.52     
Total input tokens:                      10240     
Total generated tokens:                  1280      
Request throughput (req/s):              0.95      
Output token throughput (tok/s):         121.64    
Peak output token throughput (tok/s):    170.00    
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          1094.79   
---------------Time to First Token----------------
Mean TTFT (ms):                          1693.08   
Median TTFT (ms):                        1731.72   
P99 TTFT (ms):                           2623.26   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          68.42     
Median TPOT (ms):                        68.12     
P99 TPOT (ms):                           76.79     
---------------Inter-token Latency----------------
Mean ITL (ms):                           68.42     
Median ITL (ms):                         62.29     
P99 ITL (ms):                            515.58    
==================================================
============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  37.87     
Total input tokens:                      102400    
Total generated tokens:                  12800     
Request throughput (req/s):              2.64      
Output token throughput (tok/s):         338.03    
Peak output token throughput (tok/s):    900.00    
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          3042.23   
---------------Time to First Token----------------
Mean TTFT (ms):                          11606.98  
Median TTFT (ms):                        11121.38  
P99 TTFT (ms):                           24464.04  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          190.38    
Median TPOT (ms):                        193.39    
P99 TPOT (ms):                           260.85    
---------------Inter-token Latency----------------
Mean ITL (ms):                           190.38    
Median ITL (ms):                         111.25    
P99 ITL (ms):                            539.90    
==================================================

Qwen/Qwen3-VL-235B-A22B-Instruct-FP8

4 node (tp)

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  5.76      
Total input tokens:                      1024      
Total generated tokens:                  128       
Request throughput (req/s):              0.17      
Output token throughput (tok/s):         22.23     
Peak output token throughput (tok/s):    23.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          200.04    
---------------Time to First Token----------------
Mean TTFT (ms):                          127.74    
Median TTFT (ms):                        127.74    
P99 TTFT (ms):                           127.74    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          44.34     
Median TPOT (ms):                        44.34     
P99 TPOT (ms):                           44.34     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.34     
Median ITL (ms):                         43.91     
P99 ITL (ms):                            47.46     
==================================================
============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  24.28     
Total input tokens:                      10240     
Total generated tokens:                  1280      
Request throughput (req/s):              0.41      
Output token throughput (tok/s):         52.72     
Peak output token throughput (tok/s):    70.00     
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          474.52    
---------------Time to First Token----------------
Mean TTFT (ms):                          2665.94   
Median TTFT (ms):                        2717.83   
P99 TTFT (ms):                           4135.27   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          168.49    
Median TPOT (ms):                        168.25    
P99 TPOT (ms):                           180.70    
---------------Inter-token Latency----------------
Mean ITL (ms):                           168.49    
Median ITL (ms):                         161.96    
P99 ITL (ms):                            788.41    
==================================================
============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  74.38     
Total input tokens:                      102400    
Total generated tokens:                  12800     
Request throughput (req/s):              1.34      
Output token throughput (tok/s):         172.08    
Peak output token throughput (tok/s):    400.00    
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          1548.71   
---------------Time to First Token----------------
Mean TTFT (ms):                          17847.62  
Median TTFT (ms):                        17271.93  
P99 TTFT (ms):                           38251.50  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          401.05    
Median TPOT (ms):                        405.94    
P99 TPOT (ms):                           489.78    
---------------Inter-token Latency----------------
Mean ITL (ms):                           401.05    
Median ITL (ms):                         301.32    
P99 ITL (ms):                            846.77    
==================================================

QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ

4 node (tp)

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  3.95      
Total input tokens:                      1024      
Total generated tokens:                  128       
Request throughput (req/s):              0.25      
Output token throughput (tok/s):         32.37     
Peak output token throughput (tok/s):    33.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          291.31    
---------------Time to First Token----------------
Mean TTFT (ms):                          88.34     
Median TTFT (ms):                        88.34     
P99 TTFT (ms):                           88.34     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          30.44     
Median TPOT (ms):                        30.44     
P99 TPOT (ms):                           30.44     
---------------Inter-token Latency----------------
Mean ITL (ms):                           30.44     
Median ITL (ms):                         30.26     
P99 ITL (ms):                            32.65     
==================================================
============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  15.86     
Total input tokens:                      10240     
Total generated tokens:                  1280      
Request throughput (req/s):              0.63      
Output token throughput (tok/s):         80.71     
Peak output token throughput (tok/s):    110.00    
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          726.39    
---------------Time to First Token----------------
Mean TTFT (ms):                          2363.01   
Median TTFT (ms):                        2390.65   
P99 TTFT (ms):                           3841.38   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          105.40    
Median TPOT (ms):                        105.28    
P99 TPOT (ms):                           116.69    
---------------Inter-token Latency----------------
Mean ITL (ms):                           105.40    
Median ITL (ms):                         96.24     
P99 ITL (ms):                            660.66    
==================================================

GPT-OSS-20B

4 node (tp)

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  1.50      
Total input tokens:                      1024      
Total generated tokens:                  128       
Request throughput (req/s):              0.67      
Output token throughput (tok/s):         85.25     
Peak output token throughput (tok/s):    83.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          767.26    
---------------Time to First Token----------------
Mean TTFT (ms):                          41.07     
Median TTFT (ms):                        41.07     
P99 TTFT (ms):                           41.07     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          11.50     
Median TPOT (ms):                        11.50     
P99 TPOT (ms):                           11.50     
---------------Inter-token Latency----------------
Mean ITL (ms):                           11.50     
Median ITL (ms):                         10.66     
P99 ITL (ms):                            20.37     
==================================================
============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  3.70      
Total input tokens:                      10240     
Total generated tokens:                  1280      
Request throughput (req/s):              2.71      
Output token throughput (tok/s):         346.25    
Peak output token throughput (tok/s):    460.00    
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          3116.22   
---------------Time to First Token----------------
Mean TTFT (ms):                          643.37    
Median TTFT (ms):                        613.75    
P99 TTFT (ms):                           1084.12   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          23.79     
Median TPOT (ms):                        24.05     
P99 TPOT (ms):                           27.63     
---------------Inter-token Latency----------------
Mean ITL (ms):                           23.79     
Median ITL (ms):                         19.11     
P99 ITL (ms):                            179.76    
==================================================
============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  12.81     
Total input tokens:                      102400    
Total generated tokens:                  12800     
Request throughput (req/s):              7.80      
Output token throughput (tok/s):         999.02    
Peak output token throughput (tok/s):    2800.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          8991.14   
---------------Time to First Token----------------
Mean TTFT (ms):                          3979.85   
Median TTFT (ms):                        3904.99   
P99 TTFT (ms):                           8486.11   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          64.12     
Median TPOT (ms):                        65.14     
P99 TPOT (ms):                           87.99     
---------------Inter-token Latency----------------
Mean ITL (ms):                           64.12     
Median ITL (ms):                         35.34     
P99 ITL (ms):                            187.69    
==================================================

GPT-OSS-120B

8 node (tp)

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  2.07      
Total input tokens:                      1024      
Total generated tokens:                  128       
Request throughput (req/s):              0.48      
Output token throughput (tok/s):         61.86     
Peak output token throughput (tok/s):    64.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          556.70    
---------------Time to First Token----------------
Mean TTFT (ms):                          47.08     
Median TTFT (ms):                        47.08     
P99 TTFT (ms):                           47.08     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          15.92     
Median TPOT (ms):                        15.92     
P99 TPOT (ms):                           15.92     
---------------Inter-token Latency----------------
Mean ITL (ms):                           15.92     
Median ITL (ms):                         14.10     
P99 ITL (ms):                            24.64     
==================================================
============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  5.01      
Total input tokens:                      10240     
Total generated tokens:                  1280      
Request throughput (req/s):              2.00      
Output token throughput (tok/s):         255.50    
Peak output token throughput (tok/s):    360.00    
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          2299.53   
---------------Time to First Token----------------
Mean TTFT (ms):                          877.16    
Median TTFT (ms):                        928.99    
P99 TTFT (ms):                           1400.55   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          32.17     
Median TPOT (ms):                        31.83     
P99 TPOT (ms):                           37.86     
---------------Inter-token Latency----------------
Mean ITL (ms):                           32.17     
Median ITL (ms):                         25.62     
P99 ITL (ms):                            214.43    
==================================================
============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  21.57     
Total input tokens:                      102400    
Total generated tokens:                  12800     
Request throughput (req/s):              4.64      
Output token throughput (tok/s):         593.55    
Peak output token throughput (tok/s):    1600.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          5341.94   
---------------Time to First Token----------------
Mean TTFT (ms):                          6357.06   
Median TTFT (ms):                        6132.10   
P99 TTFT (ms):                           13638.07  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          111.45    
Median TPOT (ms):                        114.05    
P99 TPOT (ms):                           149.78    
---------------Inter-token Latency----------------
Mean ITL (ms):                           111.45    
Median ITL (ms):                         63.22     
P99 ITL (ms):                            306.93    
==================================================

4 node (tp)

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  1.79      
Total input tokens:                      1024      
Total generated tokens:                  128       
Request throughput (req/s):              0.56      
Output token throughput (tok/s):         71.54     
Peak output token throughput (tok/s):    71.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          643.89    
---------------Time to First Token----------------
Mean TTFT (ms):                          37.10     
Median TTFT (ms):                        37.10     
P99 TTFT (ms):                           37.10     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.79     
Median TPOT (ms):                        13.79     
P99 TPOT (ms):                           13.79     
---------------Inter-token Latency----------------
Mean ITL (ms):                           13.79     
Median ITL (ms):                         13.73     
P99 ITL (ms):                            15.01     
==================================================
============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  5.21      
Total input tokens:                      10240     
Total generated tokens:                  1280      
Request throughput (req/s):              1.92      
Output token throughput (tok/s):         245.57    
Peak output token throughput (tok/s):    340.00    
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          2210.11   
---------------Time to First Token----------------
Mean TTFT (ms):                          908.51    
Median TTFT (ms):                        962.65    
P99 TTFT (ms):                           1462.06   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.48     
Median TPOT (ms):                        33.10     
P99 TPOT (ms):                           38.47     
---------------Inter-token Latency----------------
Mean ITL (ms):                           33.48     
Median ITL (ms):                         29.98     
P99 ITL (ms):                            298.82    
==================================================
============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  24.67     
Total input tokens:                      102400    
Total generated tokens:                  12800     
Request throughput (req/s):              4.05      
Output token throughput (tok/s):         518.78    
Peak output token throughput (tok/s):    1300.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          4669.03   
---------------Time to First Token----------------
Mean TTFT (ms):                          6719.06   
Median TTFT (ms):                        6547.16   
P99 TTFT (ms):                           14641.29  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          130.26    
Median TPOT (ms):                        132.70    
P99 TPOT (ms):                           167.78    
---------------Inter-token Latency----------------
Mean ITL (ms):                           130.26    
Median ITL (ms):                         84.86     
P99 ITL (ms):                            424.07    
==================================================

zai-org/GLM-4.6-FP8

4 node (tp)

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  8.40      
Total input tokens:                      1024      
Total generated tokens:                  128       
Request throughput (req/s):              0.12      
Output token throughput (tok/s):         15.23     
Peak output token throughput (tok/s):    16.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          137.08    
---------------Time to First Token----------------
Mean TTFT (ms):                          224.50    
Median TTFT (ms):                        224.50    
P99 TTFT (ms):                           224.50    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          64.40     
Median TPOT (ms):                        64.40     
P99 TPOT (ms):                           64.40     
---------------Inter-token Latency----------------
Mean ITL (ms):                           64.40     
Median ITL (ms):                         64.32     
P99 ITL (ms):                            66.00     
==================================================
============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  40.67     
Total input tokens:                      10240     
Total generated tokens:                  1280      
Request throughput (req/s):              0.25      
Output token throughput (tok/s):         31.48     
Peak output token throughput (tok/s):    40.00     
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          283.28    
---------------Time to First Token----------------
Mean TTFT (ms):                          5797.18   
Median TTFT (ms):                        5759.51   
P99 TTFT (ms):                           8694.79   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          272.29    
Median TPOT (ms):                        272.72    
P99 TPOT (ms):                           301.66    
---------------Inter-token Latency----------------
Mean ITL (ms):                           272.29    
Median ITL (ms):                         257.01    
P99 ITL (ms):                            1718.29   
==================================================
============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  142.28    
Total input tokens:                      102400    
Total generated tokens:                  12800     
Request throughput (req/s):              0.70      
Output token throughput (tok/s):         89.96     
Peak output token throughput (tok/s):    200.00    
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          809.67    
---------------Time to First Token----------------
Mean TTFT (ms):                          38421.92  
Median TTFT (ms):                        36869.57  
P99 TTFT (ms):                           81050.88  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          749.07    
Median TPOT (ms):                        761.59    
P99 TPOT (ms):                           962.73    
---------------Inter-token Latency----------------
Mean ITL (ms):                           749.07    
Median ITL (ms):                         514.18    
P99 ITL (ms):                            1774.77   
==================================================

nvidia/Llama-4-Scout-17B-16E-Instruct-NVFP4

4 node (tp)

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  4.58      
Total input tokens:                      1023      
Total generated tokens:                  128       
Request throughput (req/s):              0.22      
Output token throughput (tok/s):         27.93     
Peak output token throughput (tok/s):    30.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          251.18    
---------------Time to First Token----------------
Mean TTFT (ms):                          382.88    
Median TTFT (ms):                        382.88    
P99 TTFT (ms):                           382.88    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.07     
Median TPOT (ms):                        33.07     
P99 TPOT (ms):                           33.07     
---------------Inter-token Latency----------------
Mean ITL (ms):                           33.07     
Median ITL (ms):                         30.99     
P99 ITL (ms):                            43.77     
==================================================
============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  11.95     
Total input tokens:                      10230     
Total generated tokens:                  1280      
Request throughput (req/s):              0.84      
Output token throughput (tok/s):         107.10    
Peak output token throughput (tok/s):    150.00    
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          963.11    
---------------Time to First Token----------------
Mean TTFT (ms):                          1712.12   
Median TTFT (ms):                        1843.62   
P99 TTFT (ms):                           2657.49   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          76.48     
Median TPOT (ms):                        74.81     
P99 TPOT (ms):                           84.39     
---------------Inter-token Latency----------------
Mean ITL (ms):                           76.48     
Median ITL (ms):                         67.11     
P99 ITL (ms):                            394.16    
==================================================
============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  41.58     
Total input tokens:                      102300    
Total generated tokens:                  12800     
Request throughput (req/s):              2.40      
Output token throughput (tok/s):         307.83    
Peak output token throughput (tok/s):    800.00    
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          2768.06   
---------------Time to First Token----------------
Mean TTFT (ms):                          10097.28  
Median TTFT (ms):                        9424.95   
P99 TTFT (ms):                           22795.64  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          222.56    
Median TPOT (ms):                        227.89    
P99 TPOT (ms):                           269.39    
---------------Inter-token Latency----------------
Mean ITL (ms):                           222.56    
Median ITL (ms):                         140.08    
P99 ITL (ms):                            618.16    
==================================================

nvidia/Llama-3.3-70B-Instruct-NVFP4

4 node (tp)

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  10.03     
Total input tokens:                      1023      
Total generated tokens:                  128       
Request throughput (req/s):              0.10      
Output token throughput (tok/s):         12.77     
Peak output token throughput (tok/s):    14.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          114.79    
---------------Time to First Token----------------
Mean TTFT (ms):                          312.52    
Median TTFT (ms):                        312.52    
P99 TTFT (ms):                           312.52    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          76.49     
Median TPOT (ms):                        76.49     
P99 TPOT (ms):                           76.49     
---------------Inter-token Latency----------------
Mean ITL (ms):                           76.49     
Median ITL (ms):                         73.79     
P99 ITL (ms):                            87.22     
==================================================
============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  21.36     
Total input tokens:                      10230     
Total generated tokens:                  1280      
Request throughput (req/s):              0.47      
Output token throughput (tok/s):         59.93     
Peak output token throughput (tok/s):    90.00     
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          538.92    
---------------Time to First Token----------------
Mean TTFT (ms):                          3472.04   
Median TTFT (ms):                        3544.04   
P99 TTFT (ms):                           5023.57   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          138.56    
Median TPOT (ms):                        138.05    
P99 TPOT (ms):                           151.98    
---------------Inter-token Latency----------------
Mean ITL (ms):                           138.56    
Median ITL (ms):                         125.58    
P99 ITL (ms):                            839.30    
==================================================
============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  69.30     
Total input tokens:                      102300    
Total generated tokens:                  12800     
Request throughput (req/s):              1.44      
Output token throughput (tok/s):         184.69    
Peak output token throughput (tok/s):    600.00    
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          1660.78   
---------------Time to First Token----------------
Mean TTFT (ms):                          23961.51  
Median TTFT (ms):                        23834.79  
P99 TTFT (ms):                           47879.51  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          331.80    
Median TPOT (ms):                        333.44    
P99 TPOT (ms):                           481.50    
---------------Inter-token Latency----------------
Mean ITL (ms):                           331.80    
Median ITL (ms):                         182.62    
P99 ITL (ms):                            1731.80   
==================================================

Summary of Results

  • GPT-OSS-20B is the throughput champion. It delivers the highest raw performance across all concurrency levels, hitting ~9k tok/s total at 100 concurrent requests with remarkably low TPOT (64ms). Single-request latency is excellent (TTFT 41ms), and it scales gracefully under load.

  • GPT-OSS-120B offers the best balance for a large model. It has the snappiest single-request behavior (TTFT 37ms) and maintains reasonable latency even at scale, reaching ~4.7k tok/s total at 100 concurrent. TPOT stays controlled (130ms at 100 reqs) compared to other big models.

  • Qwen3-VL-32B-FP8 is solid for moderate workloads. Single-request latency is acceptable (TTFT 84ms), and it reaches ~3k tok/s total at 100 concurrent. However, TTFT climbs significantly under load (≈11.6s at 100 reqs), making it feel sluggish for interactive use at high concurrency.

  • Llama-4-Scout-17B-16E-NVFP4 performs similarly to Qwen3-VL-32B under load. Comparable scaling behavior (TTFT ≈10s at 100 reqs, ~2.8k tok/s total), though single-request TTFT is higher (383ms) due to MoE routing overhead.

  • Qwen3-VL-235B-A22B-AWQ improves significantly over the FP8 variant at low concurrency. Single-request TPOT drops from 44ms to 30ms, and TTFT from 128ms to 88ms. At 10 concurrent, it’s still faster (TPOT 105ms vs 168ms), making AWQ worthwhile for latency-sensitive deployments of this model.

  • Qwen3-VL-235B-A22B-FP8 is strongly latency-bound. Acceptable at single requests, but TTFT explodes with concurrency (≈17.8s at 100 reqs) and TPOT becomes very high (401ms). Throughput caps around ~1.5k tok/s total.

  • Llama-3.3-70B-NVFP4 struggles with the FP4 quantization overhead. Despite being smaller than GPT-OSS-120B, it’s slower across the board—higher TTFT, worse TPOT, and lower throughput (~1.7k tok/s at 100 concurrent).

  • GLM-4.6-FP8 degrades the hardest under load. TTFT becomes extreme (≈38s at 100 reqs) and TPOT balloons to 749ms. Not suitable for interactive or high-concurrency serving.