6x Spark setup

I have 4x clustered currently. Anything anyone is curious to try or see? I’ll be expanding it by 2 more tomorrow.

3 Likes

Pictures. Post some pictures please!

A bit messy currently. Not sure where 5 and 6 will be going quite yet.

2 Likes

Would be good to see if NanoChat could be trained within 24 hours. And if you would cross 100k tokens processed per second during training.

Recipe for two nodes: Train nanochat on 2 NVIDIA DGX Sparks.md Β· GitHub

Precisely one of the experiments I want to check out. Including doing a from scratch pre-train etc.

They’re rather good training devices.

1 Like

Would be good to see the stats of the Mikrotik switch during the session as well. Your test results will be very important for some of the decisions I need to take in the near term about my home lab expansion.

1 Like

I’m really interested to see how well vllm scales tensor parallel beyond 2 sparks.
What switch are you using?

Can you run inference on a few models with tensor-parallel=8?

If possible, these ones, so I could compare to my dual setup:

  • Qwen/Qwen3-VL-32B-Instruct-FP8
  • GPT-OSS-120B
  • QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ
  • Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 (I can’t run this one though)
  • zai-org/GLM-4.6-FP8 (I can’t run this either, only AWQ 4-bit quant)
  • any dense model > 100B (like the new Devstral 2)
  • And, of course, Deepseek 3.2 :)
3 Likes

I believe it’s the MikroTik Β· CRS812 DDQ

2 Likes

Correct. Using 2 400β†’200gbps splitters, and 2 of the 200gbps slots for a total of 6 machines connected via the IB fabric. The bandwidth tests hit more or less the same numbers as direct connect.

Should be 6 TP since 6 boxes in this case yes? Do you have preferred benchmarks or vLLM image to use? I build my own nightlies β€” so I’m not sure how it’ll align with expectations.

Yes, you are right, for 6 it would be 6 tp. I don’t know why I read 6 as 8 units :)
I’m using nightly builds too with Triton, Torch and Flashinfer from main branch and cu130 wheels, so no problem there.

For benchmarks, just use vllm bench serve like this:

vllm bench serve \
  --backend vllm \
  --model Qwen/Qwen3-VL-32B-Instruct-FP8 \
  --host spark \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 1 \
  --port 8888

For num prompts 1, 10 and 100 (in this order).

3 Likes

Only pp works on most models. Guess I need 8 :)

Are you going to combine 50Gbps ports with cable like this 1m (3ft) 200G QSFP56 to 4X50G SFP56 Passive DAC Breakout Cable for NVIDIA/Mellanox 30AWG - NADDOD or replace the switch?

Oh, I forgot about this quirk. Can you at least run on 4 sparks? with -tp 4 :)

1 Like

I am probably going to add another switch and bridge them.

edit: also to note, it’s much easier to split the DACs vs combine them. if you have an idea for a switch that I can easily add 2 more nodes (without having to bridge or costing a billion dollar) im all ears.

here is what some traffic looks like during the nanochat training session, the TX is maxing at 24.6 Gbps, but that is somewhat to be expected since the traffic is quite bursty:

sorry for the late updates, but having to do work things with them first :)

that said, here is the output from a slightly modified nanochat run for the interested:

6 nodes, batch size 21

root@buttercup:/workspace# torchrun --nnodes=6 --nproc_per_node=1 --node_rank=0     --master_addr=$MASTER_ADDR --master_port=29500     -m scripts.base_train – --max_seq_len=2048 --device_batch_size=21 --total_batch_size=516096

                                                   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
                                                  β–‘β–‘β–ˆβ–ˆβ–ˆ                β–‘β–‘β–ˆβ–ˆβ–ˆ
 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  β–‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
β–‘β–‘β–ˆβ–ˆβ–ˆβ–‘β–‘β–ˆβ–ˆβ–ˆ  β–‘β–‘β–‘β–‘β–‘β–ˆβ–ˆβ–ˆ β–‘β–‘β–ˆβ–ˆβ–ˆβ–‘β–‘β–ˆβ–ˆβ–ˆ  β–ˆβ–ˆβ–ˆβ–‘β–‘β–ˆβ–ˆβ–ˆ β–ˆβ–ˆβ–ˆβ–‘β–‘β–ˆβ–ˆβ–ˆ β–‘β–ˆβ–ˆβ–ˆβ–‘β–‘β–ˆβ–ˆβ–ˆ  β–‘β–‘β–‘β–‘β–‘β–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–ˆβ–ˆβ–ˆβ–‘
 β–‘β–ˆβ–ˆβ–ˆ β–‘β–ˆβ–ˆβ–ˆ   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  β–‘β–ˆβ–ˆβ–ˆ β–‘β–ˆβ–ˆβ–ˆ β–‘β–ˆβ–ˆβ–ˆ β–‘β–ˆβ–ˆβ–ˆβ–‘β–ˆβ–ˆβ–ˆ β–‘β–‘β–‘  β–‘β–ˆβ–ˆβ–ˆ β–‘β–ˆβ–ˆβ–ˆ   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  β–‘β–ˆβ–ˆβ–ˆ
 β–‘β–ˆβ–ˆβ–ˆ β–‘β–ˆβ–ˆβ–ˆ  β–ˆβ–ˆβ–ˆβ–‘β–‘β–ˆβ–ˆβ–ˆ  β–‘β–ˆβ–ˆβ–ˆ β–‘β–ˆβ–ˆβ–ˆ β–‘β–ˆβ–ˆβ–ˆ β–‘β–ˆβ–ˆβ–ˆβ–‘β–ˆβ–ˆβ–ˆ  β–ˆβ–ˆβ–ˆ β–‘β–ˆβ–ˆβ–ˆ β–‘β–ˆβ–ˆβ–ˆ  β–ˆβ–ˆβ–ˆβ–‘β–‘β–ˆβ–ˆβ–ˆ  β–‘β–ˆβ–ˆβ–ˆ β–ˆβ–ˆβ–ˆ
 β–ˆβ–ˆβ–ˆβ–ˆ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β–ˆβ–ˆβ–ˆβ–ˆ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β–‘β–‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  β–ˆβ–ˆβ–ˆβ–ˆ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  β–‘β–‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
β–‘β–‘β–‘β–‘ β–‘β–‘β–‘β–‘β–‘  β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ β–‘β–‘β–‘β–‘ β–‘β–‘β–‘β–‘β–‘  β–‘β–‘β–‘β–‘β–‘β–‘   β–‘β–‘β–‘β–‘β–‘β–‘  β–‘β–‘β–‘β–‘ β–‘β–‘β–‘β–‘β–‘  β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘   β–‘β–‘β–‘β–‘β–‘

Overriding: max_seq_len = 2048
Overriding: device_batch_size = 21
Overriding: total_batch_size = 516096
Autodetected device type: cuda
/usr/local/lib/python3.12/dist-packages/torch/init.py:1614: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = β€˜tf32’ or torch.backends.cuda.matmul.fp32_precision = β€˜ieee’. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see 
 (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/Context.cpp:45.)
_C._set_float32_matmul_precision(precision)
2025-12-18 01:59:35,110 - nanochat.common - INFO - Distributed world size: 6
Vocab size: 65,536
num_layers: 20
model_dim: 1280
num_heads: 10
num_kv_heads: 10
Tokens / micro-batch / rank: 21 x 2048 = 43,008
Tokens / micro-batch: 258,048
Total batch size 516,096 => gradient accumulation steps: 2
Number of parameters: 560,988,160
Estimated FLOPs per token: 3.491758e+09
Calculated number of iterations from target data:param ratio: 21,739
Total number of training tokens: 11,219,410,944
Tokens : Params ratio: 20.00
Total training FLOPs estimate: 3.917547e+19
Scaling the LR for the AdamW parameters ∝1/√(1280/768) = 0.774597
AdamW optimizer: torch.optim.AdamW (world_size=6)
Muon: Grouping 80 params of shape torch.Size([1280, 1280]), device cuda:0, dtype torch.float32
Muon: Grouping 20 params of shape torch.Size([1280, 5120]), device cuda:0, dtype torch.float32
Muon: Grouping 20 params of shape torch.Size([5120, 1280]), device cuda:0, dtype torch.float32
Step 00000 | Validation bpb: 3.3005
step 00000/21739 (0.00%) | loss: 11.090356 | grad norm: 0.4254 | lrm: 1.00 | dt: 11600.72ms | tok/sec: 44,488 | mfu: 2.62 | total time: 0.00m
step 00001/21739 (0.00%) | loss: 10.856434 | grad norm: 12.2922 | lrm: 1.00 | dt: 5202.11ms | tok/sec: 99,208 | mfu: 5.84 | total time: 0.00m
step 00002/21739 (0.01%) | loss: 10.255845 | grad norm: 4.9579 | lrm: 1.00 | dt: 5238.49ms | tok/sec: 98,519 | mfu: 5.80 | total time: 0.00m
step 00003/21739 (0.01%) | loss: 9.647144 | grad norm: 4.5589 | lrm: 1.00 | dt: 5221.48ms | tok/sec: 98,840 | mfu: 5.82 | total time: 0.00m
step 00004/21739 (0.02%) | loss: 9.104345 | grad norm: 6.5869 | lrm: 1.00 | dt: 5257.93ms | tok/sec: 98,155 | mfu: 5.78 | total time: 0.00m
step 00005/21739 (0.02%) | loss: 8.712114 | grad norm: 5.1133 | lrm: 1.00 | dt: 5219.17ms | tok/sec: 98,884 | mfu: 5.82 | total time: 0.00m
step 00006/21739 (0.03%) | loss: 8.421516 | grad norm: 5.5131 | lrm: 1.00 | dt: 5254.20ms | tok/sec: 98,225 | mfu: 5.78 | total time: 0.00m
step 00007/21739 (0.03%) | loss: 8.210005 | grad norm: 5.9095 | lrm: 1.00 | dt: 5228.70ms | tok/sec: 98,704 | mfu: 5.81 | total time: 0.00m
step 00008/21739 (0.04%) | loss: 8.025025 | grad norm: 6.3944 | lrm: 1.00 | dt: 5294.74ms | tok/sec: 97,473 | mfu: 5.74 | total time: 0.00m
step 00009/21739 (0.04%) | loss: 7.857136 | grad norm: 1.9207 | lrm: 1.00 | dt: 5203.78ms | tok/sec: 99,177 | mfu: 5.84 | total time: 0.00m

6 nodes, batch size 32.

step 00000/14266 (0.00%) | loss: 11.090355 | grad norm: 0.4273 | lrm: 1.00 | dt: 26923.54ms | tok/sec: 29,209 | mfu: 1.72 | total time: 0.00m
step 00001/14266 (0.01%) | loss: 10.832258 | grad norm: 12.1549 | lrm: 1.00 | dt: 7812.14ms | tok/sec: 100,667 | mfu: 5.92 | total time: 0.00m
step 00002/14266 (0.01%) | loss: 10.247421 | grad norm: 5.1460 | lrm: 1.00 | dt: 7806.38ms | tok/sec: 100,742 | mfu: 5.93 | total time: 0.00m
step 00003/14266 (0.02%) | loss: 9.577996 | grad norm: 4.4216 | lrm: 1.00 | dt: 7871.57ms | tok/sec: 99,907 | mfu: 5.88 | total time: 0.00m
step 00004/14266 (0.03%) | loss: 9.017444 | grad norm: 5.9643 | lrm: 1.00 | dt: 7838.93ms | tok/sec: 100,323 | mfu: 5.90 | total time: 0.00m
step 00005/14266 (0.04%) | loss: 8.618053 | grad norm: 5.0371 | lrm: 1.00 | dt: 7839.13ms | tok/sec: 100,321 | mfu: 5.90 | total time: 0.00m
step 00006/14266 (0.04%) | loss: 8.354537 | grad norm: 4.9139 | lrm: 1.00 | dt: 7892.11ms | tok/sec: 99,647 | mfu: 5.86 | total time: 0.00m
step 00007/14266 (0.05%) | loss: 8.156961 | grad norm: 5.1475 | lrm: 1.00 | dt: 7870.68ms | tok/sec: 99,919 | mfu: 5.88 | total time: 0.00m
step 00008/14266 (0.06%) | loss: 7.977603 | grad norm: 4.9869 | lrm: 1.00 | dt: 7929.55ms | tok/sec: 99,177 | mfu: 5.84 | total time: 0.00m
step 00009/14266 (0.06%) | loss: 7.811369 | grad norm: 2.6385 | lrm: 1.00 | dt: 7891.02ms | tok/sec: 99,661 | mfu: 5.86 | total time: 0.00m
step 00010/14266 (0.07%) | loss: 7.678523 | grad norm: 2.2132 | lrm: 1.00 | dt: 7895.18ms | tok/sec: 99,609 | mfu: 5.86 | total time: 0.00m
step 00011/14266 (0.08%) | loss: 7.566596 | grad norm: 1.9411 | lrm: 1.00 | dt: 7973.75ms | tok/sec: 98,627 | mfu: 5.80 | total time: 0.13m
step 00012/14266 (0.08%) | loss: 7.478057 | grad norm: 3.3588 | lrm: 1.00 | dt: 7997.09ms | tok/sec: 98,339 | mfu: 5.79 | total time: 0.27m

6 nodes, batch size 40:

Step 00000 | Validation bpb: 3.3005step 00000/11413 (0.00%) | loss: 11.090355 | grad norm: 0.4369 | lrm: 1.00 | dt: 31846.23ms | tok/sec: 30,868 | mfu: 1.82 | total time: 0.00m
step 00001/11413 (0.01%) | loss: 10.837876 | grad norm: 11.9723 | lrm: 1.00 | dt: 10966.73ms | tok/sec: 89,638 | mfu: 5.27 | total time: 0.00m
step 00002/11413 (0.02%) | loss: 10.234440 | grad norm: 5.1827 | lrm: 1.00 | dt: 10978.53ms | tok/sec: 89,542 | mfu: 5.27 | total time: 0.00m
step 00003/11413 (0.03%) | loss: 9.564955 | grad norm: 4.5415 | lrm: 1.00 | dt: 10987.01ms | tok/sec: 89,472 | mfu: 5.26 | total time: 0.00m
step 00004/11413 (0.04%) | loss: 8.999896 | grad norm: 5.9772 | lrm: 1.00 | dt: 11036.76ms | tok/sec: 89,069 | mfu: 5.24 | total time: 0.00m
step 00005/11413 (0.04%) | loss: 8.636850 | grad norm: 5.7050 | lrm: 1.00 | dt: 10912.49ms | tok/sec: 90,083 | mfu: 5.30 | total time: 0.00m
step 00006/11413 (0.05%) | loss: 8.344168 | grad norm: 3.5539 | lrm: 1.00 | dt: 10990.30ms | tok/sec: 89,446 | mfu: 5.26 | total time: 0.00m
step 00007/11413 (0.06%) | loss: 8.109464 | grad norm: 4.5131 | lrm: 1.00 | dt: 10978.89ms | tok/sec: 89,539 | mfu: 5.27 | total time: 0.00m
step 00008/11413 (0.07%) | loss: 7.942097 | grad norm: 6.6981 | lrm: 1.00 | dt: 11064.19ms | tok/sec: 88,848 | mfu: 5.23 | total time: 0.00m
step 00009/11413 (0.08%) | loss: 7.762931 | grad norm: 1.8954 | lrm: 1.00 | dt: 10984.50ms | tok/sec: 89,493 | mfu: 5.27 | total time: 0.00m
step 00010/11413 (0.09%) | loss: 7.614245 | grad norm: 2.3076 | lrm: 1.00 | dt: 10977.75ms | tok/sec: 89,548 | mfu: 5.27 | total time: 0.00m
step 00011/11413 (0.10%) | loss: 7.505912 | grad norm: 1.7908 | lrm: 1.00 | dt: 11068.57ms | tok/sec: 88,813 | mfu: 5.23 | total time: 0.18m
step 00012/11413 (0.11%) | loss: 7.403605 | grad norm: 3.6414 | lrm: 1.00 | dt: 11041.89ms | tok/sec: 89,028 | mfu: 5.24 | total time: 0.37m
step 00013/11413 (0.11%) | loss: 7.300447 | grad norm: 1.6861 | lrm: 1.00 | dt: 11145.53ms | tok/sec: 88,200 | mfu: 5.19 | total time: 0.55m

edit: add a nice little nvitop view. training gets the power pull higher than you really see it anywhere else. same with temps. will need to do something about that tomorrow.

2 Likes

looks like we get just under 100k tps

1 Like

Command templates:

docker exec -it vllm_node bash -i -c β€œvllm serve M --host 0.0.0.0 --trust_remote_code --gpu-memory-utilization 0.8 -pp 1 -tp X --distributed-executor-backend ray --load-format fastsafetensors --kv-cache-dtype fp8”
vllm bench serve --backend vllm --model M --host 10.20.0.4 --endpoint /v1/completions --hf-name sharegpt --num-prompts X --port 8000

Qwen/Qwen3-VL-32B-Instruct-FP8

4 nodes (tp)

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  6.68      
Total input tokens:                      1024      
Total generated tokens:                  128       
Request throughput (req/s):              0.15      
Output token throughput (tok/s):         19.15     
Peak output token throughput (tok/s):    20.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          172.39    
---------------Time to First Token----------------
Mean TTFT (ms):                          83.91     
Median TTFT (ms):                        83.91     
P99 TTFT (ms):                           83.91     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          51.95     
Median TPOT (ms):                        51.95     
P99 TPOT (ms):                           51.95     
---------------Inter-token Latency----------------
Mean ITL (ms):                           51.95     
Median ITL (ms):                         51.75     
P99 ITL (ms):                            55.00     
==================================================
============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  10.52     
Total input tokens:                      10240     
Total generated tokens:                  1280      
Request throughput (req/s):              0.95      
Output token throughput (tok/s):         121.64    
Peak output token throughput (tok/s):    170.00    
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          1094.79   
---------------Time to First Token----------------
Mean TTFT (ms):                          1693.08   
Median TTFT (ms):                        1731.72   
P99 TTFT (ms):                           2623.26   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          68.42     
Median TPOT (ms):                        68.12     
P99 TPOT (ms):                           76.79     
---------------Inter-token Latency----------------
Mean ITL (ms):                           68.42     
Median ITL (ms):                         62.29     
P99 ITL (ms):                            515.58    
==================================================
============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  37.87     
Total input tokens:                      102400    
Total generated tokens:                  12800     
Request throughput (req/s):              2.64      
Output token throughput (tok/s):         338.03    
Peak output token throughput (tok/s):    900.00    
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          3042.23   
---------------Time to First Token----------------
Mean TTFT (ms):                          11606.98  
Median TTFT (ms):                        11121.38  
P99 TTFT (ms):                           24464.04  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          190.38    
Median TPOT (ms):                        193.39    
P99 TPOT (ms):                           260.85    
---------------Inter-token Latency----------------
Mean ITL (ms):                           190.38    
Median ITL (ms):                         111.25    
P99 ITL (ms):                            539.90    
==================================================

Qwen/Qwen3-VL-235B-A22B-Instruct-FP8

4 node (tp)

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  5.76      
Total input tokens:                      1024      
Total generated tokens:                  128       
Request throughput (req/s):              0.17      
Output token throughput (tok/s):         22.23     
Peak output token throughput (tok/s):    23.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          200.04    
---------------Time to First Token----------------
Mean TTFT (ms):                          127.74    
Median TTFT (ms):                        127.74    
P99 TTFT (ms):                           127.74    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          44.34     
Median TPOT (ms):                        44.34     
P99 TPOT (ms):                           44.34     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.34     
Median ITL (ms):                         43.91     
P99 ITL (ms):                            47.46     
==================================================
============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  24.28     
Total input tokens:                      10240     
Total generated tokens:                  1280      
Request throughput (req/s):              0.41      
Output token throughput (tok/s):         52.72     
Peak output token throughput (tok/s):    70.00     
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          474.52    
---------------Time to First Token----------------
Mean TTFT (ms):                          2665.94   
Median TTFT (ms):                        2717.83   
P99 TTFT (ms):                           4135.27   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          168.49    
Median TPOT (ms):                        168.25    
P99 TPOT (ms):                           180.70    
---------------Inter-token Latency----------------
Mean ITL (ms):                           168.49    
Median ITL (ms):                         161.96    
P99 ITL (ms):                            788.41    
==================================================
============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  74.38     
Total input tokens:                      102400    
Total generated tokens:                  12800     
Request throughput (req/s):              1.34      
Output token throughput (tok/s):         172.08    
Peak output token throughput (tok/s):    400.00    
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          1548.71   
---------------Time to First Token----------------
Mean TTFT (ms):                          17847.62  
Median TTFT (ms):                        17271.93  
P99 TTFT (ms):                           38251.50  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          401.05    
Median TPOT (ms):                        405.94    
P99 TPOT (ms):                           489.78    
---------------Inter-token Latency----------------
Mean ITL (ms):                           401.05    
Median ITL (ms):                         301.32    
P99 ITL (ms):                            846.77    
==================================================

QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ

4 node (tp)

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  3.95      
Total input tokens:                      1024      
Total generated tokens:                  128       
Request throughput (req/s):              0.25      
Output token throughput (tok/s):         32.37     
Peak output token throughput (tok/s):    33.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          291.31    
---------------Time to First Token----------------
Mean TTFT (ms):                          88.34     
Median TTFT (ms):                        88.34     
P99 TTFT (ms):                           88.34     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          30.44     
Median TPOT (ms):                        30.44     
P99 TPOT (ms):                           30.44     
---------------Inter-token Latency----------------
Mean ITL (ms):                           30.44     
Median ITL (ms):                         30.26     
P99 ITL (ms):                            32.65     
==================================================
============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  15.86     
Total input tokens:                      10240     
Total generated tokens:                  1280      
Request throughput (req/s):              0.63      
Output token throughput (tok/s):         80.71     
Peak output token throughput (tok/s):    110.00    
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          726.39    
---------------Time to First Token----------------
Mean TTFT (ms):                          2363.01   
Median TTFT (ms):                        2390.65   
P99 TTFT (ms):                           3841.38   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          105.40    
Median TPOT (ms):                        105.28    
P99 TPOT (ms):                           116.69    
---------------Inter-token Latency----------------
Mean ITL (ms):                           105.40    
Median ITL (ms):                         96.24     
P99 ITL (ms):                            660.66    
==================================================

GPT-OSS-20B

4 node (tp)

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  1.50      
Total input tokens:                      1024      
Total generated tokens:                  128       
Request throughput (req/s):              0.67      
Output token throughput (tok/s):         85.25     
Peak output token throughput (tok/s):    83.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          767.26    
---------------Time to First Token----------------
Mean TTFT (ms):                          41.07     
Median TTFT (ms):                        41.07     
P99 TTFT (ms):                           41.07     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          11.50     
Median TPOT (ms):                        11.50     
P99 TPOT (ms):                           11.50     
---------------Inter-token Latency----------------
Mean ITL (ms):                           11.50     
Median ITL (ms):                         10.66     
P99 ITL (ms):                            20.37     
==================================================
============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  3.70      
Total input tokens:                      10240     
Total generated tokens:                  1280      
Request throughput (req/s):              2.71      
Output token throughput (tok/s):         346.25    
Peak output token throughput (tok/s):    460.00    
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          3116.22   
---------------Time to First Token----------------
Mean TTFT (ms):                          643.37    
Median TTFT (ms):                        613.75    
P99 TTFT (ms):                           1084.12   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          23.79     
Median TPOT (ms):                        24.05     
P99 TPOT (ms):                           27.63     
---------------Inter-token Latency----------------
Mean ITL (ms):                           23.79     
Median ITL (ms):                         19.11     
P99 ITL (ms):                            179.76    
==================================================
============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  12.81     
Total input tokens:                      102400    
Total generated tokens:                  12800     
Request throughput (req/s):              7.80      
Output token throughput (tok/s):         999.02    
Peak output token throughput (tok/s):    2800.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          8991.14   
---------------Time to First Token----------------
Mean TTFT (ms):                          3979.85   
Median TTFT (ms):                        3904.99   
P99 TTFT (ms):                           8486.11   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          64.12     
Median TPOT (ms):                        65.14     
P99 TPOT (ms):                           87.99     
---------------Inter-token Latency----------------
Mean ITL (ms):                           64.12     
Median ITL (ms):                         35.34     
P99 ITL (ms):                            187.69    
==================================================

GPT-OSS-120B

8 node (tp)

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  2.07      
Total input tokens:                      1024      
Total generated tokens:                  128       
Request throughput (req/s):              0.48      
Output token throughput (tok/s):         61.86     
Peak output token throughput (tok/s):    64.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          556.70    
---------------Time to First Token----------------
Mean TTFT (ms):                          47.08     
Median TTFT (ms):                        47.08     
P99 TTFT (ms):                           47.08     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          15.92     
Median TPOT (ms):                        15.92     
P99 TPOT (ms):                           15.92     
---------------Inter-token Latency----------------
Mean ITL (ms):                           15.92     
Median ITL (ms):                         14.10     
P99 ITL (ms):                            24.64     
==================================================
============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  5.01      
Total input tokens:                      10240     
Total generated tokens:                  1280      
Request throughput (req/s):              2.00      
Output token throughput (tok/s):         255.50    
Peak output token throughput (tok/s):    360.00    
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          2299.53   
---------------Time to First Token----------------
Mean TTFT (ms):                          877.16    
Median TTFT (ms):                        928.99    
P99 TTFT (ms):                           1400.55   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          32.17     
Median TPOT (ms):                        31.83     
P99 TPOT (ms):                           37.86     
---------------Inter-token Latency----------------
Mean ITL (ms):                           32.17     
Median ITL (ms):                         25.62     
P99 ITL (ms):                            214.43    
==================================================
============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  21.57     
Total input tokens:                      102400    
Total generated tokens:                  12800     
Request throughput (req/s):              4.64      
Output token throughput (tok/s):         593.55    
Peak output token throughput (tok/s):    1600.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          5341.94   
---------------Time to First Token----------------
Mean TTFT (ms):                          6357.06   
Median TTFT (ms):                        6132.10   
P99 TTFT (ms):                           13638.07  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          111.45    
Median TPOT (ms):                        114.05    
P99 TPOT (ms):                           149.78    
---------------Inter-token Latency----------------
Mean ITL (ms):                           111.45    
Median ITL (ms):                         63.22     
P99 ITL (ms):                            306.93    
==================================================

4 node (tp)

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  1.79      
Total input tokens:                      1024      
Total generated tokens:                  128       
Request throughput (req/s):              0.56      
Output token throughput (tok/s):         71.54     
Peak output token throughput (tok/s):    71.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          643.89    
---------------Time to First Token----------------
Mean TTFT (ms):                          37.10     
Median TTFT (ms):                        37.10     
P99 TTFT (ms):                           37.10     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.79     
Median TPOT (ms):                        13.79     
P99 TPOT (ms):                           13.79     
---------------Inter-token Latency----------------
Mean ITL (ms):                           13.79     
Median ITL (ms):                         13.73     
P99 ITL (ms):                            15.01     
==================================================
============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  5.21      
Total input tokens:                      10240     
Total generated tokens:                  1280      
Request throughput (req/s):              1.92      
Output token throughput (tok/s):         245.57    
Peak output token throughput (tok/s):    340.00    
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          2210.11   
---------------Time to First Token----------------
Mean TTFT (ms):                          908.51    
Median TTFT (ms):                        962.65    
P99 TTFT (ms):                           1462.06   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.48     
Median TPOT (ms):                        33.10     
P99 TPOT (ms):                           38.47     
---------------Inter-token Latency----------------
Mean ITL (ms):                           33.48     
Median ITL (ms):                         29.98     
P99 ITL (ms):                            298.82    
==================================================
============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  24.67     
Total input tokens:                      102400    
Total generated tokens:                  12800     
Request throughput (req/s):              4.05      
Output token throughput (tok/s):         518.78    
Peak output token throughput (tok/s):    1300.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          4669.03   
---------------Time to First Token----------------
Mean TTFT (ms):                          6719.06   
Median TTFT (ms):                        6547.16   
P99 TTFT (ms):                           14641.29  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          130.26    
Median TPOT (ms):                        132.70    
P99 TPOT (ms):                           167.78    
---------------Inter-token Latency----------------
Mean ITL (ms):                           130.26    
Median ITL (ms):                         84.86     
P99 ITL (ms):                            424.07    
==================================================

zai-org/GLM-4.6-FP8

4 node (tp)

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  8.40      
Total input tokens:                      1024      
Total generated tokens:                  128       
Request throughput (req/s):              0.12      
Output token throughput (tok/s):         15.23     
Peak output token throughput (tok/s):    16.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          137.08    
---------------Time to First Token----------------
Mean TTFT (ms):                          224.50    
Median TTFT (ms):                        224.50    
P99 TTFT (ms):                           224.50    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          64.40     
Median TPOT (ms):                        64.40     
P99 TPOT (ms):                           64.40     
---------------Inter-token Latency----------------
Mean ITL (ms):                           64.40     
Median ITL (ms):                         64.32     
P99 ITL (ms):                            66.00     
==================================================
============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  40.67     
Total input tokens:                      10240     
Total generated tokens:                  1280      
Request throughput (req/s):              0.25      
Output token throughput (tok/s):         31.48     
Peak output token throughput (tok/s):    40.00     
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          283.28    
---------------Time to First Token----------------
Mean TTFT (ms):                          5797.18   
Median TTFT (ms):                        5759.51   
P99 TTFT (ms):                           8694.79   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          272.29    
Median TPOT (ms):                        272.72    
P99 TPOT (ms):                           301.66    
---------------Inter-token Latency----------------
Mean ITL (ms):                           272.29    
Median ITL (ms):                         257.01    
P99 ITL (ms):                            1718.29   
==================================================
============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  142.28    
Total input tokens:                      102400    
Total generated tokens:                  12800     
Request throughput (req/s):              0.70      
Output token throughput (tok/s):         89.96     
Peak output token throughput (tok/s):    200.00    
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          809.67    
---------------Time to First Token----------------
Mean TTFT (ms):                          38421.92  
Median TTFT (ms):                        36869.57  
P99 TTFT (ms):                           81050.88  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          749.07    
Median TPOT (ms):                        761.59    
P99 TPOT (ms):                           962.73    
---------------Inter-token Latency----------------
Mean ITL (ms):                           749.07    
Median ITL (ms):                         514.18    
P99 ITL (ms):                            1774.77   
==================================================

nvidia/Llama-4-Scout-17B-16E-Instruct-NVFP4

4 node (tp)

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  4.58      
Total input tokens:                      1023      
Total generated tokens:                  128       
Request throughput (req/s):              0.22      
Output token throughput (tok/s):         27.93     
Peak output token throughput (tok/s):    30.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          251.18    
---------------Time to First Token----------------
Mean TTFT (ms):                          382.88    
Median TTFT (ms):                        382.88    
P99 TTFT (ms):                           382.88    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.07     
Median TPOT (ms):                        33.07     
P99 TPOT (ms):                           33.07     
---------------Inter-token Latency----------------
Mean ITL (ms):                           33.07     
Median ITL (ms):                         30.99     
P99 ITL (ms):                            43.77     
==================================================
============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  11.95     
Total input tokens:                      10230     
Total generated tokens:                  1280      
Request throughput (req/s):              0.84      
Output token throughput (tok/s):         107.10    
Peak output token throughput (tok/s):    150.00    
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          963.11    
---------------Time to First Token----------------
Mean TTFT (ms):                          1712.12   
Median TTFT (ms):                        1843.62   
P99 TTFT (ms):                           2657.49   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          76.48     
Median TPOT (ms):                        74.81     
P99 TPOT (ms):                           84.39     
---------------Inter-token Latency----------------
Mean ITL (ms):                           76.48     
Median ITL (ms):                         67.11     
P99 ITL (ms):                            394.16    
==================================================
============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  41.58     
Total input tokens:                      102300    
Total generated tokens:                  12800     
Request throughput (req/s):              2.40      
Output token throughput (tok/s):         307.83    
Peak output token throughput (tok/s):    800.00    
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          2768.06   
---------------Time to First Token----------------
Mean TTFT (ms):                          10097.28  
Median TTFT (ms):                        9424.95   
P99 TTFT (ms):                           22795.64  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          222.56    
Median TPOT (ms):                        227.89    
P99 TPOT (ms):                           269.39    
---------------Inter-token Latency----------------
Mean ITL (ms):                           222.56    
Median ITL (ms):                         140.08    
P99 ITL (ms):                            618.16    
==================================================

nvidia/Llama-3.3-70B-Instruct-NVFP4

4 node (tp)

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  10.03     
Total input tokens:                      1023      
Total generated tokens:                  128       
Request throughput (req/s):              0.10      
Output token throughput (tok/s):         12.77     
Peak output token throughput (tok/s):    14.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          114.79    
---------------Time to First Token----------------
Mean TTFT (ms):                          312.52    
Median TTFT (ms):                        312.52    
P99 TTFT (ms):                           312.52    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          76.49     
Median TPOT (ms):                        76.49     
P99 TPOT (ms):                           76.49     
---------------Inter-token Latency----------------
Mean ITL (ms):                           76.49     
Median ITL (ms):                         73.79     
P99 ITL (ms):                            87.22     
==================================================
============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  21.36     
Total input tokens:                      10230     
Total generated tokens:                  1280      
Request throughput (req/s):              0.47      
Output token throughput (tok/s):         59.93     
Peak output token throughput (tok/s):    90.00     
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          538.92    
---------------Time to First Token----------------
Mean TTFT (ms):                          3472.04   
Median TTFT (ms):                        3544.04   
P99 TTFT (ms):                           5023.57   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          138.56    
Median TPOT (ms):                        138.05    
P99 TPOT (ms):                           151.98    
---------------Inter-token Latency----------------
Mean ITL (ms):                           138.56    
Median ITL (ms):                         125.58    
P99 ITL (ms):                            839.30    
==================================================
============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  69.30     
Total input tokens:                      102300    
Total generated tokens:                  12800     
Request throughput (req/s):              1.44      
Output token throughput (tok/s):         184.69    
Peak output token throughput (tok/s):    600.00    
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          1660.78   
---------------Time to First Token----------------
Mean TTFT (ms):                          23961.51  
Median TTFT (ms):                        23834.79  
P99 TTFT (ms):                           47879.51  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          331.80    
Median TPOT (ms):                        333.44    
P99 TPOT (ms):                           481.50    
---------------Inter-token Latency----------------
Mean ITL (ms):                           331.80    
Median ITL (ms):                         182.62    
P99 ITL (ms):                            1731.80   
==================================================

Summary of Results

  • GPT-OSS-20B is the throughput champion. It delivers the highest raw performance across all concurrency levels, hitting ~9k tok/s total at 100 concurrent requests with remarkably low TPOT (64ms). Single-request latency is excellent (TTFT 41ms), and it scales gracefully under load.

  • GPT-OSS-120B offers the best balance for a large model. It has the snappiest single-request behavior (TTFT 37ms) and maintains reasonable latency even at scale, reaching ~4.7k tok/s total at 100 concurrent. TPOT stays controlled (130ms at 100 reqs) compared to other big models.

  • Qwen3-VL-32B-FP8 is solid for moderate workloads. Single-request latency is acceptable (TTFT 84ms), and it reaches ~3k tok/s total at 100 concurrent. However, TTFT climbs significantly under load (β‰ˆ11.6s at 100 reqs), making it feel sluggish for interactive use at high concurrency.

  • Llama-4-Scout-17B-16E-NVFP4 performs similarly to Qwen3-VL-32B under load. Comparable scaling behavior (TTFT β‰ˆ10s at 100 reqs, ~2.8k tok/s total), though single-request TTFT is higher (383ms) due to MoE routing overhead.

  • Qwen3-VL-235B-A22B-AWQ improves significantly over the FP8 variant at low concurrency. Single-request TPOT drops from 44ms to 30ms, and TTFT from 128ms to 88ms. At 10 concurrent, it’s still faster (TPOT 105ms vs 168ms), making AWQ worthwhile for latency-sensitive deployments of this model.

  • Qwen3-VL-235B-A22B-FP8 is strongly latency-bound. Acceptable at single requests, but TTFT explodes with concurrency (β‰ˆ17.8s at 100 reqs) and TPOT becomes very high (401ms). Throughput caps around ~1.5k tok/s total.

  • Llama-3.3-70B-NVFP4 struggles with the FP4 quantization overhead. Despite being smaller than GPT-OSS-120B, it’s slower across the boardβ€”higher TTFT, worse TPOT, and lower throughput (~1.7k tok/s at 100 concurrent).

  • GLM-4.6-FP8 degrades the hardest under load. TTFT becomes extreme (β‰ˆ38s at 100 reqs) and TPOT balloons to 749ms. Not suitable for interactive or high-concurrency serving.

5 Likes

Now we need NVIDIA to release DGX desktop 200GbE switch for handling 4 Sparks

2 Likes

Nice! So, it still scales well with 4 nodes. Pretty much doubles the performance compared to 2 node cluster for dense models and still scales nicely for β€œfaster” ones.

1 Like