6x Spark setup

ericlewis777 · December 12, 2025, 2:40am

I have 4x clustered currently. Anything anyone is curious to try or see? I’ll be expanding it by 2 more tomorrow.

elsaco · December 12, 2025, 3:03am

Pictures. Post some pictures please!

ericlewis777 · December 12, 2025, 3:08am

A bit messy currently. Not sure where 5 and 6 will be going quite yet.

raphael.amorim · December 12, 2025, 3:09am

Would be good to see if NanoChat could be trained within 24 hours. And if you would cross 100k tokens processed per second during training.

Recipe for two nodes: Train nanochat on 2 NVIDIA DGX Sparks.md · GitHub

ericlewis777 · December 12, 2025, 3:10am

Precisely one of the experiments I want to check out. Including doing a from scratch pre-train etc.

They’re rather good training devices.

raphael.amorim · December 12, 2025, 3:13am

Would be good to see the stats of the Mikrotik switch during the session as well. Your test results will be very important for some of the decisions I need to take in the near term about my home lab expansion.

eugr · December 12, 2025, 5:52am

I’m really interested to see how well vllm scales tensor parallel beyond 2 sparks.
What switch are you using?

Can you run inference on a few models with tensor-parallel=8?

If possible, these ones, so I could compare to my dual setup:

Qwen/Qwen3-VL-32B-Instruct-FP8
GPT-OSS-120B
QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ
Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 (I can’t run this one though)
zai-org/GLM-4.6-FP8 (I can’t run this either, only AWQ 4-bit quant)
any dense model > 100B (like the new Devstral 2)
And, of course, Deepseek 3.2 :)

raphael.amorim · December 12, 2025, 1:13pm

I believe it’s the MikroTik · CRS812 DDQ

ericlewis777 · December 12, 2025, 1:21pm

Correct. Using 2 400→200gbps splitters, and 2 of the 200gbps slots for a total of 6 machines connected via the IB fabric. The bandwidth tests hit more or less the same numbers as direct connect.

ericlewis777 · December 12, 2025, 1:28pm

Should be 6 TP since 6 boxes in this case yes? Do you have preferred benchmarks or vLLM image to use? I build my own nightlies — so I’m not sure how it’ll align with expectations.

eugr · December 12, 2025, 3:44pm

Yes, you are right, for 6 it would be 6 tp. I don’t know why I read 6 as 8 units :)
I’m using nightly builds too with Triton, Torch and Flashinfer from main branch and cu130 wheels, so no problem there.

For benchmarks, just use vllm bench serve like this:

vllm bench serve \
  --backend vllm \
  --model Qwen/Qwen3-VL-32B-Instruct-FP8 \
  --host spark \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 1 \
  --port 8888

For num prompts 1, 10 and 100 (in this order).

ericlewis777 · December 15, 2025, 1:23pm

Only pp works on most models. Guess I need 8 :)

raphael.amorim · December 15, 2025, 3:18pm

Are you going to combine 50Gbps ports with cable like this 1m (3ft) 200G QSFP56 to 4X50G SFP56 Passive DAC Breakout Cable for NVIDIA/Mellanox 30AWG - NADDOD or replace the switch?

eugr · December 15, 2025, 5:19pm

Oh, I forgot about this quirk. Can you at least run on 4 sparks? with -tp 4 :)

ericlewis777 · December 18, 2025, 2:02am

I am probably going to add another switch and bridge them.

edit: also to note, it’s much easier to split the DACs vs combine them. if you have an idea for a switch that I can easily add 2 more nodes (without having to bridge or costing a billion dollar) im all ears.

here is what some traffic looks like during the nanochat training session, the TX is maxing at 24.6 Gbps, but that is somewhat to be expected since the traffic is quite bursty:

ericlewis777 · December 18, 2025, 2:03am

sorry for the late updates, but having to do work things with them first :)

that said, here is the output from a slightly modified nanochat run for the interested:

6 nodes, batch size 21

root@buttercup:/workspace# torchrun --nnodes=6 --nproc_per_node=1 --node_rank=0     --master_addr=$MASTER_ADDR --master_port=29500     -m scripts.base_train – --max_seq_len=2048 --device_batch_size=21 --total_batch_size=516096

                                                   █████                █████
                                                  ░░███                ░░███
 ████████    ██████   ████████    ██████   ██████  ░███████    ██████  ███████
░░███░░███  ░░░░░███ ░░███░░███  ███░░███ ███░░███ ░███░░███  ░░░░░███░░░███░
 ░███ ░███   ███████  ░███ ░███ ░███ ░███░███ ░░░  ░███ ░███   ███████  ░███
 ░███ ░███  ███░░███  ░███ ░███ ░███ ░███░███  ███ ░███ ░███  ███░░███  ░███ ███
 ████ █████░░████████ ████ █████░░██████ ░░██████  ████ █████░░███████  ░░█████
░░░░ ░░░░░  ░░░░░░░░ ░░░░ ░░░░░  ░░░░░░   ░░░░░░  ░░░░ ░░░░░  ░░░░░░░░   ░░░░░

Overriding: max_seq_len = 2048
Overriding: device_batch_size = 21
Overriding: total_batch_size = 516096
Autodetected device type: cuda
/usr/local/lib/python3.12/dist-packages/torch/init.py:1614: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = ‘tf32’ or torch.backends.cuda.matmul.fp32_precision = ‘ieee’. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see 
 (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/Context.cpp:45.)
_C._set_float32_matmul_precision(precision)
2025-12-18 01:59:35,110 - nanochat.common - INFO - Distributed world size: 6
Vocab size: 65,536
num_layers: 20
model_dim: 1280
num_heads: 10
num_kv_heads: 10
Tokens / micro-batch / rank: 21 x 2048 = 43,008
Tokens / micro-batch: 258,048
Total batch size 516,096 => gradient accumulation steps: 2
Number of parameters: 560,988,160
Estimated FLOPs per token: 3.491758e+09
Calculated number of iterations from target data:param ratio: 21,739
Total number of training tokens: 11,219,410,944
Tokens : Params ratio: 20.00
Total training FLOPs estimate: 3.917547e+19
Scaling the LR for the AdamW parameters ∝1/√(1280/768) = 0.774597
AdamW optimizer: torch.optim.AdamW (world_size=6)
Muon: Grouping 80 params of shape torch.Size([1280, 1280]), device cuda:0, dtype torch.float32
Muon: Grouping 20 params of shape torch.Size([1280, 5120]), device cuda:0, dtype torch.float32
Muon: Grouping 20 params of shape torch.Size([5120, 1280]), device cuda:0, dtype torch.float32
Step 00000 | Validation bpb: 3.3005
step 00000/21739 (0.00%) | loss: 11.090356 | grad norm: 0.4254 | lrm: 1.00 | dt: 11600.72ms | tok/sec: 44,488 | mfu: 2.62 | total time: 0.00m
step 00001/21739 (0.00%) | loss: 10.856434 | grad norm: 12.2922 | lrm: 1.00 | dt: 5202.11ms | tok/sec: 99,208 | mfu: 5.84 | total time: 0.00m
step 00002/21739 (0.01%) | loss: 10.255845 | grad norm: 4.9579 | lrm: 1.00 | dt: 5238.49ms | tok/sec: 98,519 | mfu: 5.80 | total time: 0.00m
step 00003/21739 (0.01%) | loss: 9.647144 | grad norm: 4.5589 | lrm: 1.00 | dt: 5221.48ms | tok/sec: 98,840 | mfu: 5.82 | total time: 0.00m
step 00004/21739 (0.02%) | loss: 9.104345 | grad norm: 6.5869 | lrm: 1.00 | dt: 5257.93ms | tok/sec: 98,155 | mfu: 5.78 | total time: 0.00m
step 00005/21739 (0.02%) | loss: 8.712114 | grad norm: 5.1133 | lrm: 1.00 | dt: 5219.17ms | tok/sec: 98,884 | mfu: 5.82 | total time: 0.00m
step 00006/21739 (0.03%) | loss: 8.421516 | grad norm: 5.5131 | lrm: 1.00 | dt: 5254.20ms | tok/sec: 98,225 | mfu: 5.78 | total time: 0.00m
step 00007/21739 (0.03%) | loss: 8.210005 | grad norm: 5.9095 | lrm: 1.00 | dt: 5228.70ms | tok/sec: 98,704 | mfu: 5.81 | total time: 0.00m
step 00008/21739 (0.04%) | loss: 8.025025 | grad norm: 6.3944 | lrm: 1.00 | dt: 5294.74ms | tok/sec: 97,473 | mfu: 5.74 | total time: 0.00m
step 00009/21739 (0.04%) | loss: 7.857136 | grad norm: 1.9207 | lrm: 1.00 | dt: 5203.78ms | tok/sec: 99,177 | mfu: 5.84 | total time: 0.00m

6 nodes, batch size 32.

step 00000/14266 (0.00%) | loss: 11.090355 | grad norm: 0.4273 | lrm: 1.00 | dt: 26923.54ms | tok/sec: 29,209 | mfu: 1.72 | total time: 0.00m
step 00001/14266 (0.01%) | loss: 10.832258 | grad norm: 12.1549 | lrm: 1.00 | dt: 7812.14ms | tok/sec: 100,667 | mfu: 5.92 | total time: 0.00m
step 00002/14266 (0.01%) | loss: 10.247421 | grad norm: 5.1460 | lrm: 1.00 | dt: 7806.38ms | tok/sec: 100,742 | mfu: 5.93 | total time: 0.00m
step 00003/14266 (0.02%) | loss: 9.577996 | grad norm: 4.4216 | lrm: 1.00 | dt: 7871.57ms | tok/sec: 99,907 | mfu: 5.88 | total time: 0.00m
step 00004/14266 (0.03%) | loss: 9.017444 | grad norm: 5.9643 | lrm: 1.00 | dt: 7838.93ms | tok/sec: 100,323 | mfu: 5.90 | total time: 0.00m
step 00005/14266 (0.04%) | loss: 8.618053 | grad norm: 5.0371 | lrm: 1.00 | dt: 7839.13ms | tok/sec: 100,321 | mfu: 5.90 | total time: 0.00m
step 00006/14266 (0.04%) | loss: 8.354537 | grad norm: 4.9139 | lrm: 1.00 | dt: 7892.11ms | tok/sec: 99,647 | mfu: 5.86 | total time: 0.00m
step 00007/14266 (0.05%) | loss: 8.156961 | grad norm: 5.1475 | lrm: 1.00 | dt: 7870.68ms | tok/sec: 99,919 | mfu: 5.88 | total time: 0.00m
step 00008/14266 (0.06%) | loss: 7.977603 | grad norm: 4.9869 | lrm: 1.00 | dt: 7929.55ms | tok/sec: 99,177 | mfu: 5.84 | total time: 0.00m
step 00009/14266 (0.06%) | loss: 7.811369 | grad norm: 2.6385 | lrm: 1.00 | dt: 7891.02ms | tok/sec: 99,661 | mfu: 5.86 | total time: 0.00m
step 00010/14266 (0.07%) | loss: 7.678523 | grad norm: 2.2132 | lrm: 1.00 | dt: 7895.18ms | tok/sec: 99,609 | mfu: 5.86 | total time: 0.00m
step 00011/14266 (0.08%) | loss: 7.566596 | grad norm: 1.9411 | lrm: 1.00 | dt: 7973.75ms | tok/sec: 98,627 | mfu: 5.80 | total time: 0.13m
step 00012/14266 (0.08%) | loss: 7.478057 | grad norm: 3.3588 | lrm: 1.00 | dt: 7997.09ms | tok/sec: 98,339 | mfu: 5.79 | total time: 0.27m

6 nodes, batch size 40:

Step 00000 | Validation bpb: 3.3005step 00000/11413 (0.00%) | loss: 11.090355 | grad norm: 0.4369 | lrm: 1.00 | dt: 31846.23ms | tok/sec: 30,868 | mfu: 1.82 | total time: 0.00m
step 00001/11413 (0.01%) | loss: 10.837876 | grad norm: 11.9723 | lrm: 1.00 | dt: 10966.73ms | tok/sec: 89,638 | mfu: 5.27 | total time: 0.00m
step 00002/11413 (0.02%) | loss: 10.234440 | grad norm: 5.1827 | lrm: 1.00 | dt: 10978.53ms | tok/sec: 89,542 | mfu: 5.27 | total time: 0.00m
step 00003/11413 (0.03%) | loss: 9.564955 | grad norm: 4.5415 | lrm: 1.00 | dt: 10987.01ms | tok/sec: 89,472 | mfu: 5.26 | total time: 0.00m
step 00004/11413 (0.04%) | loss: 8.999896 | grad norm: 5.9772 | lrm: 1.00 | dt: 11036.76ms | tok/sec: 89,069 | mfu: 5.24 | total time: 0.00m
step 00005/11413 (0.04%) | loss: 8.636850 | grad norm: 5.7050 | lrm: 1.00 | dt: 10912.49ms | tok/sec: 90,083 | mfu: 5.30 | total time: 0.00m
step 00006/11413 (0.05%) | loss: 8.344168 | grad norm: 3.5539 | lrm: 1.00 | dt: 10990.30ms | tok/sec: 89,446 | mfu: 5.26 | total time: 0.00m
step 00007/11413 (0.06%) | loss: 8.109464 | grad norm: 4.5131 | lrm: 1.00 | dt: 10978.89ms | tok/sec: 89,539 | mfu: 5.27 | total time: 0.00m
step 00008/11413 (0.07%) | loss: 7.942097 | grad norm: 6.6981 | lrm: 1.00 | dt: 11064.19ms | tok/sec: 88,848 | mfu: 5.23 | total time: 0.00m
step 00009/11413 (0.08%) | loss: 7.762931 | grad norm: 1.8954 | lrm: 1.00 | dt: 10984.50ms | tok/sec: 89,493 | mfu: 5.27 | total time: 0.00m
step 00010/11413 (0.09%) | loss: 7.614245 | grad norm: 2.3076 | lrm: 1.00 | dt: 10977.75ms | tok/sec: 89,548 | mfu: 5.27 | total time: 0.00m
step 00011/11413 (0.10%) | loss: 7.505912 | grad norm: 1.7908 | lrm: 1.00 | dt: 11068.57ms | tok/sec: 88,813 | mfu: 5.23 | total time: 0.18m
step 00012/11413 (0.11%) | loss: 7.403605 | grad norm: 3.6414 | lrm: 1.00 | dt: 11041.89ms | tok/sec: 89,028 | mfu: 5.24 | total time: 0.37m
step 00013/11413 (0.11%) | loss: 7.300447 | grad norm: 1.6861 | lrm: 1.00 | dt: 11145.53ms | tok/sec: 88,200 | mfu: 5.19 | total time: 0.55m

edit: add a nice little nvitop view. training gets the power pull higher than you really see it anywhere else. same with temps. will need to do something about that tomorrow.

ericlewis777 · December 18, 2025, 2:04am

looks like we get just under 100k tps

ericlewis777 · December 18, 2025, 7:55am

Command templates:

docker exec -it vllm_node bash -i -c “vllm serve M --host 0.0.0.0 --trust_remote_code --gpu-memory-utilization 0.8 -pp 1 -tp X --distributed-executor-backend ray --load-format fastsafetensors --kv-cache-dtype fp8”

vllm bench serve --backend vllm --model M --host 10.20.0.4 --endpoint /v1/completions --hf-name sharegpt --num-prompts X --port 8000

Qwen/Qwen3-VL-32B-Instruct-FP8

4 nodes (tp)

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  6.68      
Total input tokens:                      1024      
Total generated tokens:                  128       
Request throughput (req/s):              0.15      
Output token throughput (tok/s):         19.15     
Peak output token throughput (tok/s):    20.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          172.39    
---------------Time to First Token----------------
Mean TTFT (ms):                          83.91     
Median TTFT (ms):                        83.91     
P99 TTFT (ms):                           83.91     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          51.95     
Median TPOT (ms):                        51.95     
P99 TPOT (ms):                           51.95     
---------------Inter-token Latency----------------
Mean ITL (ms):                           51.95     
Median ITL (ms):                         51.75     
P99 ITL (ms):                            55.00     
==================================================

============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  10.52     
Total input tokens:                      10240     
Total generated tokens:                  1280      
Request throughput (req/s):              0.95      
Output token throughput (tok/s):         121.64    
Peak output token throughput (tok/s):    170.00    
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          1094.79   
---------------Time to First Token----------------
Mean TTFT (ms):                          1693.08   
Median TTFT (ms):                        1731.72   
P99 TTFT (ms):                           2623.26   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          68.42     
Median TPOT (ms):                        68.12     
P99 TPOT (ms):                           76.79     
---------------Inter-token Latency----------------
Mean ITL (ms):                           68.42     
Median ITL (ms):                         62.29     
P99 ITL (ms):                            515.58    
==================================================

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  37.87     
Total input tokens:                      102400    
Total generated tokens:                  12800     
Request throughput (req/s):              2.64      
Output token throughput (tok/s):         338.03    
Peak output token throughput (tok/s):    900.00    
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          3042.23   
---------------Time to First Token----------------
Mean TTFT (ms):                          11606.98  
Median TTFT (ms):                        11121.38  
P99 TTFT (ms):                           24464.04  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          190.38    
Median TPOT (ms):                        193.39    
P99 TPOT (ms):                           260.85    
---------------Inter-token Latency----------------
Mean ITL (ms):                           190.38    
Median ITL (ms):                         111.25    
P99 ITL (ms):                            539.90    
==================================================

Qwen/Qwen3-VL-235B-A22B-Instruct-FP8

4 node (tp)

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  5.76      
Total input tokens:                      1024      
Total generated tokens:                  128       
Request throughput (req/s):              0.17      
Output token throughput (tok/s):         22.23     
Peak output token throughput (tok/s):    23.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          200.04    
---------------Time to First Token----------------
Mean TTFT (ms):                          127.74    
Median TTFT (ms):                        127.74    
P99 TTFT (ms):                           127.74    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          44.34     
Median TPOT (ms):                        44.34     
P99 TPOT (ms):                           44.34     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.34     
Median ITL (ms):                         43.91     
P99 ITL (ms):                            47.46     
==================================================

============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  24.28     
Total input tokens:                      10240     
Total generated tokens:                  1280      
Request throughput (req/s):              0.41      
Output token throughput (tok/s):         52.72     
Peak output token throughput (tok/s):    70.00     
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          474.52    
---------------Time to First Token----------------
Mean TTFT (ms):                          2665.94   
Median TTFT (ms):                        2717.83   
P99 TTFT (ms):                           4135.27   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          168.49    
Median TPOT (ms):                        168.25    
P99 TPOT (ms):                           180.70    
---------------Inter-token Latency----------------
Mean ITL (ms):                           168.49    
Median ITL (ms):                         161.96    
P99 ITL (ms):                            788.41    
==================================================

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  74.38     
Total input tokens:                      102400    
Total generated tokens:                  12800     
Request throughput (req/s):              1.34      
Output token throughput (tok/s):         172.08    
Peak output token throughput (tok/s):    400.00    
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          1548.71   
---------------Time to First Token----------------
Mean TTFT (ms):                          17847.62  
Median TTFT (ms):                        17271.93  
P99 TTFT (ms):                           38251.50  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          401.05    
Median TPOT (ms):                        405.94    
P99 TPOT (ms):                           489.78    
---------------Inter-token Latency----------------
Mean ITL (ms):                           401.05    
Median ITL (ms):                         301.32    
P99 ITL (ms):                            846.77    
==================================================

QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ

4 node (tp)

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  3.95      
Total input tokens:                      1024      
Total generated tokens:                  128       
Request throughput (req/s):              0.25      
Output token throughput (tok/s):         32.37     
Peak output token throughput (tok/s):    33.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          291.31    
---------------Time to First Token----------------
Mean TTFT (ms):                          88.34     
Median TTFT (ms):                        88.34     
P99 TTFT (ms):                           88.34     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          30.44     
Median TPOT (ms):                        30.44     
P99 TPOT (ms):                           30.44     
---------------Inter-token Latency----------------
Mean ITL (ms):                           30.44     
Median ITL (ms):                         30.26     
P99 ITL (ms):                            32.65     
==================================================

============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  15.86     
Total input tokens:                      10240     
Total generated tokens:                  1280      
Request throughput (req/s):              0.63      
Output token throughput (tok/s):         80.71     
Peak output token throughput (tok/s):    110.00    
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          726.39    
---------------Time to First Token----------------
Mean TTFT (ms):                          2363.01   
Median TTFT (ms):                        2390.65   
P99 TTFT (ms):                           3841.38   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          105.40    
Median TPOT (ms):                        105.28    
P99 TPOT (ms):                           116.69    
---------------Inter-token Latency----------------
Mean ITL (ms):                           105.40    
Median ITL (ms):                         96.24     
P99 ITL (ms):                            660.66    
==================================================

GPT-OSS-20B

4 node (tp)

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  1.50      
Total input tokens:                      1024      
Total generated tokens:                  128       
Request throughput (req/s):              0.67      
Output token throughput (tok/s):         85.25     
Peak output token throughput (tok/s):    83.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          767.26    
---------------Time to First Token----------------
Mean TTFT (ms):                          41.07     
Median TTFT (ms):                        41.07     
P99 TTFT (ms):                           41.07     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          11.50     
Median TPOT (ms):                        11.50     
P99 TPOT (ms):                           11.50     
---------------Inter-token Latency----------------
Mean ITL (ms):                           11.50     
Median ITL (ms):                         10.66     
P99 ITL (ms):                            20.37     
==================================================

============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  3.70      
Total input tokens:                      10240     
Total generated tokens:                  1280      
Request throughput (req/s):              2.71      
Output token throughput (tok/s):         346.25    
Peak output token throughput (tok/s):    460.00    
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          3116.22   
---------------Time to First Token----------------
Mean TTFT (ms):                          643.37    
Median TTFT (ms):                        613.75    
P99 TTFT (ms):                           1084.12   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          23.79     
Median TPOT (ms):                        24.05     
P99 TPOT (ms):                           27.63     
---------------Inter-token Latency----------------
Mean ITL (ms):                           23.79     
Median ITL (ms):                         19.11     
P99 ITL (ms):                            179.76    
==================================================

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  12.81     
Total input tokens:                      102400    
Total generated tokens:                  12800     
Request throughput (req/s):              7.80      
Output token throughput (tok/s):         999.02    
Peak output token throughput (tok/s):    2800.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          8991.14   
---------------Time to First Token----------------
Mean TTFT (ms):                          3979.85   
Median TTFT (ms):                        3904.99   
P99 TTFT (ms):                           8486.11   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          64.12     
Median TPOT (ms):                        65.14     
P99 TPOT (ms):                           87.99     
---------------Inter-token Latency----------------
Mean ITL (ms):                           64.12     
Median ITL (ms):                         35.34     
P99 ITL (ms):                            187.69    
==================================================

GPT-OSS-120B

8 node (tp)

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  2.07      
Total input tokens:                      1024      
Total generated tokens:                  128       
Request throughput (req/s):              0.48      
Output token throughput (tok/s):         61.86     
Peak output token throughput (tok/s):    64.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          556.70    
---------------Time to First Token----------------
Mean TTFT (ms):                          47.08     
Median TTFT (ms):                        47.08     
P99 TTFT (ms):                           47.08     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          15.92     
Median TPOT (ms):                        15.92     
P99 TPOT (ms):                           15.92     
---------------Inter-token Latency----------------
Mean ITL (ms):                           15.92     
Median ITL (ms):                         14.10     
P99 ITL (ms):                            24.64     
==================================================

============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  5.01      
Total input tokens:                      10240     
Total generated tokens:                  1280      
Request throughput (req/s):              2.00      
Output token throughput (tok/s):         255.50    
Peak output token throughput (tok/s):    360.00    
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          2299.53   
---------------Time to First Token----------------
Mean TTFT (ms):                          877.16    
Median TTFT (ms):                        928.99    
P99 TTFT (ms):                           1400.55   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          32.17     
Median TPOT (ms):                        31.83     
P99 TPOT (ms):                           37.86     
---------------Inter-token Latency----------------
Mean ITL (ms):                           32.17     
Median ITL (ms):                         25.62     
P99 ITL (ms):                            214.43    
==================================================

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  21.57     
Total input tokens:                      102400    
Total generated tokens:                  12800     
Request throughput (req/s):              4.64      
Output token throughput (tok/s):         593.55    
Peak output token throughput (tok/s):    1600.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          5341.94   
---------------Time to First Token----------------
Mean TTFT (ms):                          6357.06   
Median TTFT (ms):                        6132.10   
P99 TTFT (ms):                           13638.07  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          111.45    
Median TPOT (ms):                        114.05    
P99 TPOT (ms):                           149.78    
---------------Inter-token Latency----------------
Mean ITL (ms):                           111.45    
Median ITL (ms):                         63.22     
P99 ITL (ms):                            306.93    
==================================================

4 node (tp)

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  1.79      
Total input tokens:                      1024      
Total generated tokens:                  128       
Request throughput (req/s):              0.56      
Output token throughput (tok/s):         71.54     
Peak output token throughput (tok/s):    71.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          643.89    
---------------Time to First Token----------------
Mean TTFT (ms):                          37.10     
Median TTFT (ms):                        37.10     
P99 TTFT (ms):                           37.10     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.79     
Median TPOT (ms):                        13.79     
P99 TPOT (ms):                           13.79     
---------------Inter-token Latency----------------
Mean ITL (ms):                           13.79     
Median ITL (ms):                         13.73     
P99 ITL (ms):                            15.01     
==================================================

============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  5.21      
Total input tokens:                      10240     
Total generated tokens:                  1280      
Request throughput (req/s):              1.92      
Output token throughput (tok/s):         245.57    
Peak output token throughput (tok/s):    340.00    
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          2210.11   
---------------Time to First Token----------------
Mean TTFT (ms):                          908.51    
Median TTFT (ms):                        962.65    
P99 TTFT (ms):                           1462.06   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.48     
Median TPOT (ms):                        33.10     
P99 TPOT (ms):                           38.47     
---------------Inter-token Latency----------------
Mean ITL (ms):                           33.48     
Median ITL (ms):                         29.98     
P99 ITL (ms):                            298.82    
==================================================

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  24.67     
Total input tokens:                      102400    
Total generated tokens:                  12800     
Request throughput (req/s):              4.05      
Output token throughput (tok/s):         518.78    
Peak output token throughput (tok/s):    1300.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          4669.03   
---------------Time to First Token----------------
Mean TTFT (ms):                          6719.06   
Median TTFT (ms):                        6547.16   
P99 TTFT (ms):                           14641.29  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          130.26    
Median TPOT (ms):                        132.70    
P99 TPOT (ms):                           167.78    
---------------Inter-token Latency----------------
Mean ITL (ms):                           130.26    
Median ITL (ms):                         84.86     
P99 ITL (ms):                            424.07    
==================================================

zai-org/GLM-4.6-FP8

4 node (tp)

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  8.40      
Total input tokens:                      1024      
Total generated tokens:                  128       
Request throughput (req/s):              0.12      
Output token throughput (tok/s):         15.23     
Peak output token throughput (tok/s):    16.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          137.08    
---------------Time to First Token----------------
Mean TTFT (ms):                          224.50    
Median TTFT (ms):                        224.50    
P99 TTFT (ms):                           224.50    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          64.40     
Median TPOT (ms):                        64.40     
P99 TPOT (ms):                           64.40     
---------------Inter-token Latency----------------
Mean ITL (ms):                           64.40     
Median ITL (ms):                         64.32     
P99 ITL (ms):                            66.00     
==================================================

============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  40.67     
Total input tokens:                      10240     
Total generated tokens:                  1280      
Request throughput (req/s):              0.25      
Output token throughput (tok/s):         31.48     
Peak output token throughput (tok/s):    40.00     
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          283.28    
---------------Time to First Token----------------
Mean TTFT (ms):                          5797.18   
Median TTFT (ms):                        5759.51   
P99 TTFT (ms):                           8694.79   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          272.29    
Median TPOT (ms):                        272.72    
P99 TPOT (ms):                           301.66    
---------------Inter-token Latency----------------
Mean ITL (ms):                           272.29    
Median ITL (ms):                         257.01    
P99 ITL (ms):                            1718.29   
==================================================

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  142.28    
Total input tokens:                      102400    
Total generated tokens:                  12800     
Request throughput (req/s):              0.70      
Output token throughput (tok/s):         89.96     
Peak output token throughput (tok/s):    200.00    
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          809.67    
---------------Time to First Token----------------
Mean TTFT (ms):                          38421.92  
Median TTFT (ms):                        36869.57  
P99 TTFT (ms):                           81050.88  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          749.07    
Median TPOT (ms):                        761.59    
P99 TPOT (ms):                           962.73    
---------------Inter-token Latency----------------
Mean ITL (ms):                           749.07    
Median ITL (ms):                         514.18    
P99 ITL (ms):                            1774.77   
==================================================

nvidia/Llama-4-Scout-17B-16E-Instruct-NVFP4

4 node (tp)

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  4.58      
Total input tokens:                      1023      
Total generated tokens:                  128       
Request throughput (req/s):              0.22      
Output token throughput (tok/s):         27.93     
Peak output token throughput (tok/s):    30.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          251.18    
---------------Time to First Token----------------
Mean TTFT (ms):                          382.88    
Median TTFT (ms):                        382.88    
P99 TTFT (ms):                           382.88    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.07     
Median TPOT (ms):                        33.07     
P99 TPOT (ms):                           33.07     
---------------Inter-token Latency----------------
Mean ITL (ms):                           33.07     
Median ITL (ms):                         30.99     
P99 ITL (ms):                            43.77     
==================================================

============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  11.95     
Total input tokens:                      10230     
Total generated tokens:                  1280      
Request throughput (req/s):              0.84      
Output token throughput (tok/s):         107.10    
Peak output token throughput (tok/s):    150.00    
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          963.11    
---------------Time to First Token----------------
Mean TTFT (ms):                          1712.12   
Median TTFT (ms):                        1843.62   
P99 TTFT (ms):                           2657.49   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          76.48     
Median TPOT (ms):                        74.81     
P99 TPOT (ms):                           84.39     
---------------Inter-token Latency----------------
Mean ITL (ms):                           76.48     
Median ITL (ms):                         67.11     
P99 ITL (ms):                            394.16    
==================================================

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  41.58     
Total input tokens:                      102300    
Total generated tokens:                  12800     
Request throughput (req/s):              2.40      
Output token throughput (tok/s):         307.83    
Peak output token throughput (tok/s):    800.00    
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          2768.06   
---------------Time to First Token----------------
Mean TTFT (ms):                          10097.28  
Median TTFT (ms):                        9424.95   
P99 TTFT (ms):                           22795.64  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          222.56    
Median TPOT (ms):                        227.89    
P99 TPOT (ms):                           269.39    
---------------Inter-token Latency----------------
Mean ITL (ms):                           222.56    
Median ITL (ms):                         140.08    
P99 ITL (ms):                            618.16    
==================================================

nvidia/Llama-3.3-70B-Instruct-NVFP4

4 node (tp)

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  10.03     
Total input tokens:                      1023      
Total generated tokens:                  128       
Request throughput (req/s):              0.10      
Output token throughput (tok/s):         12.77     
Peak output token throughput (tok/s):    14.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          114.79    
---------------Time to First Token----------------
Mean TTFT (ms):                          312.52    
Median TTFT (ms):                        312.52    
P99 TTFT (ms):                           312.52    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          76.49     
Median TPOT (ms):                        76.49     
P99 TPOT (ms):                           76.49     
---------------Inter-token Latency----------------
Mean ITL (ms):                           76.49     
Median ITL (ms):                         73.79     
P99 ITL (ms):                            87.22     
==================================================

============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  21.36     
Total input tokens:                      10230     
Total generated tokens:                  1280      
Request throughput (req/s):              0.47      
Output token throughput (tok/s):         59.93     
Peak output token throughput (tok/s):    90.00     
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          538.92    
---------------Time to First Token----------------
Mean TTFT (ms):                          3472.04   
Median TTFT (ms):                        3544.04   
P99 TTFT (ms):                           5023.57   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          138.56    
Median TPOT (ms):                        138.05    
P99 TPOT (ms):                           151.98    
---------------Inter-token Latency----------------
Mean ITL (ms):                           138.56    
Median ITL (ms):                         125.58    
P99 ITL (ms):                            839.30    
==================================================

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  69.30     
Total input tokens:                      102300    
Total generated tokens:                  12800     
Request throughput (req/s):              1.44      
Output token throughput (tok/s):         184.69    
Peak output token throughput (tok/s):    600.00    
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          1660.78   
---------------Time to First Token----------------
Mean TTFT (ms):                          23961.51  
Median TTFT (ms):                        23834.79  
P99 TTFT (ms):                           47879.51  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          331.80    
Median TPOT (ms):                        333.44    
P99 TPOT (ms):                           481.50    
---------------Inter-token Latency----------------
Mean ITL (ms):                           331.80    
Median ITL (ms):                         182.62    
P99 ITL (ms):                            1731.80   
==================================================

Summary of Results

GPT-OSS-20B is the throughput champion. It delivers the highest raw performance across all concurrency levels, hitting ~9k tok/s total at 100 concurrent requests with remarkably low TPOT (64ms). Single-request latency is excellent (TTFT 41ms), and it scales gracefully under load.
GPT-OSS-120B offers the best balance for a large model. It has the snappiest single-request behavior (TTFT 37ms) and maintains reasonable latency even at scale, reaching ~4.7k tok/s total at 100 concurrent. TPOT stays controlled (130ms at 100 reqs) compared to other big models.
Qwen3-VL-32B-FP8 is solid for moderate workloads. Single-request latency is acceptable (TTFT 84ms), and it reaches ~3k tok/s total at 100 concurrent. However, TTFT climbs significantly under load (≈11.6s at 100 reqs), making it feel sluggish for interactive use at high concurrency.
Llama-4-Scout-17B-16E-NVFP4 performs similarly to Qwen3-VL-32B under load. Comparable scaling behavior (TTFT ≈10s at 100 reqs, ~2.8k tok/s total), though single-request TTFT is higher (383ms) due to MoE routing overhead.
Qwen3-VL-235B-A22B-AWQ improves significantly over the FP8 variant at low concurrency. Single-request TPOT drops from 44ms to 30ms, and TTFT from 128ms to 88ms. At 10 concurrent, it’s still faster (TPOT 105ms vs 168ms), making AWQ worthwhile for latency-sensitive deployments of this model.
Qwen3-VL-235B-A22B-FP8 is strongly latency-bound. Acceptable at single requests, but TTFT explodes with concurrency (≈17.8s at 100 reqs) and TPOT becomes very high (401ms). Throughput caps around ~1.5k tok/s total.
Llama-3.3-70B-NVFP4 struggles with the FP4 quantization overhead. Despite being smaller than GPT-OSS-120B, it’s slower across the board—higher TTFT, worse TPOT, and lower throughput (~1.7k tok/s at 100 concurrent).
GLM-4.6-FP8 degrades the hardest under load. TTFT becomes extreme (≈38s at 100 reqs) and TPOT balloons to 749ms. Not suitable for interactive or high-concurrency serving.

alan.dang · December 18, 2025, 9:26am

Now we need NVIDIA to release DGX desktop 200GbE switch for handling 4 Sparks

eugr · December 18, 2025, 5:31pm

Nice! So, it still scales well with 4 nodes. Pretty much doubles the performance compared to 2 node cluster for dense models and still scales nicely for “faster” ones.

Topic		Replies	Views
Qwen3.5-397B-A17B-int4-AutoRound - 4 x db10 node - updated results 37 - 94 tok/s DGX Spark / GB10 clustering , spark	26	1766	April 28, 2026
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	5184	December 9, 2025
Should we as a community gofundme one Spark for Eugr's nightly builds? DGX Spark / GB10	52	1550	April 15, 2026
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1409	February 13, 2026
DGX Spark performance DGX Spark / GB10	50	4837	February 27, 2026
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	16	4237	March 6, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2441	December 25, 2025
Three node Spark clusters (without a switch) are now supported in spark-vllm-docker and sparkrun! DGX Spark / GB10 llama	12	1133	May 4, 2026
Day 1 with DGX Spark (Asus version) DGX Spark / GB10	29	2098	February 7, 2026
Why 273 GB/s? Less Is More, Until It Isn’t DGX Spark / GB10	67	2327	March 27, 2026

6x Spark setup

Summary of Results

Related topics