I have 4x clustered currently. Anything anyone is curious to try or see? Iβll be expanding it by 2 more tomorrow.
Pictures. Post some pictures please!
Would be good to see if NanoChat could be trained within 24 hours. And if you would cross 100k tokens processed per second during training.
Recipe for two nodes: Train nanochat on 2 NVIDIA DGX Sparks.md Β· GitHub
Precisely one of the experiments I want to check out. Including doing a from scratch pre-train etc.
Theyβre rather good training devices.
Would be good to see the stats of the Mikrotik switch during the session as well. Your test results will be very important for some of the decisions I need to take in the near term about my home lab expansion.
Iβm really interested to see how well vllm scales tensor parallel beyond 2 sparks.
What switch are you using?
Can you run inference on a few models with tensor-parallel=8?
If possible, these ones, so I could compare to my dual setup:
- Qwen/Qwen3-VL-32B-Instruct-FP8
- GPT-OSS-120B
- QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ
- Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 (I canβt run this one though)
- zai-org/GLM-4.6-FP8 (I canβt run this either, only AWQ 4-bit quant)
- any dense model > 100B (like the new Devstral 2)
- And, of course, Deepseek 3.2 :)
I believe itβs the MikroTik Β· CRS812 DDQ
Correct. Using 2 400β200gbps splitters, and 2 of the 200gbps slots for a total of 6 machines connected via the IB fabric. The bandwidth tests hit more or less the same numbers as direct connect.
Should be 6 TP since 6 boxes in this case yes? Do you have preferred benchmarks or vLLM image to use? I build my own nightlies β so Iβm not sure how itβll align with expectations.
Yes, you are right, for 6 it would be 6 tp. I donβt know why I read 6 as 8 units :)
Iβm using nightly builds too with Triton, Torch and Flashinfer from main branch and cu130 wheels, so no problem there.
For benchmarks, just use vllm bench serve like this:
vllm bench serve \
--backend vllm \
--model Qwen/Qwen3-VL-32B-Instruct-FP8 \
--host spark \
--endpoint /v1/completions \
--dataset-name sharegpt \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 1 \
--port 8888
For num prompts 1, 10 and 100 (in this order).
Only pp works on most models. Guess I need 8 :)
Are you going to combine 50Gbps ports with cable like this 1m (3ft) 200G QSFP56 to 4X50G SFP56 Passive DAC Breakout Cable for NVIDIA/Mellanox 30AWG - NADDOD or replace the switch?
Oh, I forgot about this quirk. Can you at least run on 4 sparks? with -tp 4 :)
I am probably going to add another switch and bridge them.
edit: also to note, itβs much easier to split the DACs vs combine them. if you have an idea for a switch that I can easily add 2 more nodes (without having to bridge or costing a billion dollar) im all ears.
here is what some traffic looks like during the nanochat training session, the TX is maxing at 24.6 Gbps, but that is somewhat to be expected since the traffic is quite bursty:
sorry for the late updates, but having to do work things with them first :)
that said, here is the output from a slightly modified nanochat run for the interested:
6 nodes, batch size 21
root@buttercup:/workspace# torchrun --nnodes=6 --nproc_per_node=1 --node_rank=0 --master_addr=$MASTER_ADDR --master_port=29500 -m scripts.base_train β --max_seq_len=2048 --device_batch_size=21 --total_batch_size=516096
βββββ βββββ
βββββ βββββ
ββββββββ ββββββ ββββββββ ββββββ ββββββ ββββββββ ββββββ βββββββ
ββββββββββ ββββββββ ββββββββββ ββββββββ ββββββββ βββββββββ βββββββββββββββ
ββββ ββββ βββββββ ββββ ββββ ββββ ββββββββ βββ ββββ ββββ βββββββ ββββ
ββββ ββββ ββββββββ ββββ ββββ ββββ ββββββββ βββ ββββ ββββ ββββββββ ββββ βββ
ββββ βββββββββββββββ ββββ βββββββββββββ ββββββββ ββββ ββββββββββββββ βββββββ
ββββ βββββ ββββββββ ββββ βββββ ββββββ ββββββ ββββ βββββ ββββββββ βββββ
Overriding: max_seq_len = 2048
Overriding: device_batch_size = 21
Overriding: total_batch_size = 516096
Autodetected device type: cuda
/usr/local/lib/python3.12/dist-packages/torch/init.py:1614: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = βtf32β or torch.backends.cuda.matmul.fp32_precision = βieeeβ. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see
(Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/Context.cpp:45.)
_C._set_float32_matmul_precision(precision)
2025-12-18 01:59:35,110 - nanochat.common - INFO - Distributed world size: 6
Vocab size: 65,536
num_layers: 20
model_dim: 1280
num_heads: 10
num_kv_heads: 10
Tokens / micro-batch / rank: 21 x 2048 = 43,008
Tokens / micro-batch: 258,048
Total batch size 516,096 => gradient accumulation steps: 2
Number of parameters: 560,988,160
Estimated FLOPs per token: 3.491758e+09
Calculated number of iterations from target data:param ratio: 21,739
Total number of training tokens: 11,219,410,944
Tokens : Params ratio: 20.00
Total training FLOPs estimate: 3.917547e+19
Scaling the LR for the AdamW parameters β1/β(1280/768) = 0.774597
AdamW optimizer: torch.optim.AdamW (world_size=6)
Muon: Grouping 80 params of shape torch.Size([1280, 1280]), device cuda:0, dtype torch.float32
Muon: Grouping 20 params of shape torch.Size([1280, 5120]), device cuda:0, dtype torch.float32
Muon: Grouping 20 params of shape torch.Size([5120, 1280]), device cuda:0, dtype torch.float32
Step 00000 | Validation bpb: 3.3005
step 00000/21739 (0.00%) | loss: 11.090356 | grad norm: 0.4254 | lrm: 1.00 | dt: 11600.72ms | tok/sec: 44,488 | mfu: 2.62 | total time: 0.00m
step 00001/21739 (0.00%) | loss: 10.856434 | grad norm: 12.2922 | lrm: 1.00 | dt: 5202.11ms | tok/sec: 99,208 | mfu: 5.84 | total time: 0.00m
step 00002/21739 (0.01%) | loss: 10.255845 | grad norm: 4.9579 | lrm: 1.00 | dt: 5238.49ms | tok/sec: 98,519 | mfu: 5.80 | total time: 0.00m
step 00003/21739 (0.01%) | loss: 9.647144 | grad norm: 4.5589 | lrm: 1.00 | dt: 5221.48ms | tok/sec: 98,840 | mfu: 5.82 | total time: 0.00m
step 00004/21739 (0.02%) | loss: 9.104345 | grad norm: 6.5869 | lrm: 1.00 | dt: 5257.93ms | tok/sec: 98,155 | mfu: 5.78 | total time: 0.00m
step 00005/21739 (0.02%) | loss: 8.712114 | grad norm: 5.1133 | lrm: 1.00 | dt: 5219.17ms | tok/sec: 98,884 | mfu: 5.82 | total time: 0.00m
step 00006/21739 (0.03%) | loss: 8.421516 | grad norm: 5.5131 | lrm: 1.00 | dt: 5254.20ms | tok/sec: 98,225 | mfu: 5.78 | total time: 0.00m
step 00007/21739 (0.03%) | loss: 8.210005 | grad norm: 5.9095 | lrm: 1.00 | dt: 5228.70ms | tok/sec: 98,704 | mfu: 5.81 | total time: 0.00m
step 00008/21739 (0.04%) | loss: 8.025025 | grad norm: 6.3944 | lrm: 1.00 | dt: 5294.74ms | tok/sec: 97,473 | mfu: 5.74 | total time: 0.00m
step 00009/21739 (0.04%) | loss: 7.857136 | grad norm: 1.9207 | lrm: 1.00 | dt: 5203.78ms | tok/sec: 99,177 | mfu: 5.84 | total time: 0.00m
6 nodes, batch size 32.
step 00000/14266 (0.00%) | loss: 11.090355 | grad norm: 0.4273 | lrm: 1.00 | dt: 26923.54ms | tok/sec: 29,209 | mfu: 1.72 | total time: 0.00m
step 00001/14266 (0.01%) | loss: 10.832258 | grad norm: 12.1549 | lrm: 1.00 | dt: 7812.14ms | tok/sec: 100,667 | mfu: 5.92 | total time: 0.00m
step 00002/14266 (0.01%) | loss: 10.247421 | grad norm: 5.1460 | lrm: 1.00 | dt: 7806.38ms | tok/sec: 100,742 | mfu: 5.93 | total time: 0.00m
step 00003/14266 (0.02%) | loss: 9.577996 | grad norm: 4.4216 | lrm: 1.00 | dt: 7871.57ms | tok/sec: 99,907 | mfu: 5.88 | total time: 0.00m
step 00004/14266 (0.03%) | loss: 9.017444 | grad norm: 5.9643 | lrm: 1.00 | dt: 7838.93ms | tok/sec: 100,323 | mfu: 5.90 | total time: 0.00m
step 00005/14266 (0.04%) | loss: 8.618053 | grad norm: 5.0371 | lrm: 1.00 | dt: 7839.13ms | tok/sec: 100,321 | mfu: 5.90 | total time: 0.00m
step 00006/14266 (0.04%) | loss: 8.354537 | grad norm: 4.9139 | lrm: 1.00 | dt: 7892.11ms | tok/sec: 99,647 | mfu: 5.86 | total time: 0.00m
step 00007/14266 (0.05%) | loss: 8.156961 | grad norm: 5.1475 | lrm: 1.00 | dt: 7870.68ms | tok/sec: 99,919 | mfu: 5.88 | total time: 0.00m
step 00008/14266 (0.06%) | loss: 7.977603 | grad norm: 4.9869 | lrm: 1.00 | dt: 7929.55ms | tok/sec: 99,177 | mfu: 5.84 | total time: 0.00m
step 00009/14266 (0.06%) | loss: 7.811369 | grad norm: 2.6385 | lrm: 1.00 | dt: 7891.02ms | tok/sec: 99,661 | mfu: 5.86 | total time: 0.00m
step 00010/14266 (0.07%) | loss: 7.678523 | grad norm: 2.2132 | lrm: 1.00 | dt: 7895.18ms | tok/sec: 99,609 | mfu: 5.86 | total time: 0.00m
step 00011/14266 (0.08%) | loss: 7.566596 | grad norm: 1.9411 | lrm: 1.00 | dt: 7973.75ms | tok/sec: 98,627 | mfu: 5.80 | total time: 0.13m
step 00012/14266 (0.08%) | loss: 7.478057 | grad norm: 3.3588 | lrm: 1.00 | dt: 7997.09ms | tok/sec: 98,339 | mfu: 5.79 | total time: 0.27m
6 nodes, batch size 40:
Step 00000 | Validation bpb: 3.3005step 00000/11413 (0.00%) | loss: 11.090355 | grad norm: 0.4369 | lrm: 1.00 | dt: 31846.23ms | tok/sec: 30,868 | mfu: 1.82 | total time: 0.00m
step 00001/11413 (0.01%) | loss: 10.837876 | grad norm: 11.9723 | lrm: 1.00 | dt: 10966.73ms | tok/sec: 89,638 | mfu: 5.27 | total time: 0.00m
step 00002/11413 (0.02%) | loss: 10.234440 | grad norm: 5.1827 | lrm: 1.00 | dt: 10978.53ms | tok/sec: 89,542 | mfu: 5.27 | total time: 0.00m
step 00003/11413 (0.03%) | loss: 9.564955 | grad norm: 4.5415 | lrm: 1.00 | dt: 10987.01ms | tok/sec: 89,472 | mfu: 5.26 | total time: 0.00m
step 00004/11413 (0.04%) | loss: 8.999896 | grad norm: 5.9772 | lrm: 1.00 | dt: 11036.76ms | tok/sec: 89,069 | mfu: 5.24 | total time: 0.00m
step 00005/11413 (0.04%) | loss: 8.636850 | grad norm: 5.7050 | lrm: 1.00 | dt: 10912.49ms | tok/sec: 90,083 | mfu: 5.30 | total time: 0.00m
step 00006/11413 (0.05%) | loss: 8.344168 | grad norm: 3.5539 | lrm: 1.00 | dt: 10990.30ms | tok/sec: 89,446 | mfu: 5.26 | total time: 0.00m
step 00007/11413 (0.06%) | loss: 8.109464 | grad norm: 4.5131 | lrm: 1.00 | dt: 10978.89ms | tok/sec: 89,539 | mfu: 5.27 | total time: 0.00m
step 00008/11413 (0.07%) | loss: 7.942097 | grad norm: 6.6981 | lrm: 1.00 | dt: 11064.19ms | tok/sec: 88,848 | mfu: 5.23 | total time: 0.00m
step 00009/11413 (0.08%) | loss: 7.762931 | grad norm: 1.8954 | lrm: 1.00 | dt: 10984.50ms | tok/sec: 89,493 | mfu: 5.27 | total time: 0.00m
step 00010/11413 (0.09%) | loss: 7.614245 | grad norm: 2.3076 | lrm: 1.00 | dt: 10977.75ms | tok/sec: 89,548 | mfu: 5.27 | total time: 0.00m
step 00011/11413 (0.10%) | loss: 7.505912 | grad norm: 1.7908 | lrm: 1.00 | dt: 11068.57ms | tok/sec: 88,813 | mfu: 5.23 | total time: 0.18m
step 00012/11413 (0.11%) | loss: 7.403605 | grad norm: 3.6414 | lrm: 1.00 | dt: 11041.89ms | tok/sec: 89,028 | mfu: 5.24 | total time: 0.37m
step 00013/11413 (0.11%) | loss: 7.300447 | grad norm: 1.6861 | lrm: 1.00 | dt: 11145.53ms | tok/sec: 88,200 | mfu: 5.19 | total time: 0.55m
edit: add a nice little nvitop view. training gets the power pull higher than you really see it anywhere else. same with temps. will need to do something about that tomorrow.
looks like we get just under 100k tps
Command templates:
docker exec -it vllm_node bash -i -c βvllm serve M --host 0.0.0.0 --trust_remote_code --gpu-memory-utilization 0.8 -pp 1 -tp X --distributed-executor-backend ray --load-format fastsafetensors --kv-cache-dtype fp8β
vllm bench serve --backend vllm --model M --host 10.20.0.4 --endpoint /v1/completions --hf-name sharegpt --num-prompts X --port 8000
Qwen/Qwen3-VL-32B-Instruct-FP8
4 nodes (tp)
============ Serving Benchmark Result ============
Successful requests: 1
Failed requests: 0
Benchmark duration (s): 6.68
Total input tokens: 1024
Total generated tokens: 128
Request throughput (req/s): 0.15
Output token throughput (tok/s): 19.15
Peak output token throughput (tok/s): 20.00
Peak concurrent requests: 1.00
Total token throughput (tok/s): 172.39
---------------Time to First Token----------------
Mean TTFT (ms): 83.91
Median TTFT (ms): 83.91
P99 TTFT (ms): 83.91
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 51.95
Median TPOT (ms): 51.95
P99 TPOT (ms): 51.95
---------------Inter-token Latency----------------
Mean ITL (ms): 51.95
Median ITL (ms): 51.75
P99 ITL (ms): 55.00
==================================================
============ Serving Benchmark Result ============
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 10.52
Total input tokens: 10240
Total generated tokens: 1280
Request throughput (req/s): 0.95
Output token throughput (tok/s): 121.64
Peak output token throughput (tok/s): 170.00
Peak concurrent requests: 10.00
Total token throughput (tok/s): 1094.79
---------------Time to First Token----------------
Mean TTFT (ms): 1693.08
Median TTFT (ms): 1731.72
P99 TTFT (ms): 2623.26
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 68.42
Median TPOT (ms): 68.12
P99 TPOT (ms): 76.79
---------------Inter-token Latency----------------
Mean ITL (ms): 68.42
Median ITL (ms): 62.29
P99 ITL (ms): 515.58
==================================================
============ Serving Benchmark Result ============
Successful requests: 100
Failed requests: 0
Benchmark duration (s): 37.87
Total input tokens: 102400
Total generated tokens: 12800
Request throughput (req/s): 2.64
Output token throughput (tok/s): 338.03
Peak output token throughput (tok/s): 900.00
Peak concurrent requests: 100.00
Total token throughput (tok/s): 3042.23
---------------Time to First Token----------------
Mean TTFT (ms): 11606.98
Median TTFT (ms): 11121.38
P99 TTFT (ms): 24464.04
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 190.38
Median TPOT (ms): 193.39
P99 TPOT (ms): 260.85
---------------Inter-token Latency----------------
Mean ITL (ms): 190.38
Median ITL (ms): 111.25
P99 ITL (ms): 539.90
==================================================
Qwen/Qwen3-VL-235B-A22B-Instruct-FP8
4 node (tp)
============ Serving Benchmark Result ============
Successful requests: 1
Failed requests: 0
Benchmark duration (s): 5.76
Total input tokens: 1024
Total generated tokens: 128
Request throughput (req/s): 0.17
Output token throughput (tok/s): 22.23
Peak output token throughput (tok/s): 23.00
Peak concurrent requests: 1.00
Total token throughput (tok/s): 200.04
---------------Time to First Token----------------
Mean TTFT (ms): 127.74
Median TTFT (ms): 127.74
P99 TTFT (ms): 127.74
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 44.34
Median TPOT (ms): 44.34
P99 TPOT (ms): 44.34
---------------Inter-token Latency----------------
Mean ITL (ms): 44.34
Median ITL (ms): 43.91
P99 ITL (ms): 47.46
==================================================
============ Serving Benchmark Result ============
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 24.28
Total input tokens: 10240
Total generated tokens: 1280
Request throughput (req/s): 0.41
Output token throughput (tok/s): 52.72
Peak output token throughput (tok/s): 70.00
Peak concurrent requests: 10.00
Total token throughput (tok/s): 474.52
---------------Time to First Token----------------
Mean TTFT (ms): 2665.94
Median TTFT (ms): 2717.83
P99 TTFT (ms): 4135.27
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 168.49
Median TPOT (ms): 168.25
P99 TPOT (ms): 180.70
---------------Inter-token Latency----------------
Mean ITL (ms): 168.49
Median ITL (ms): 161.96
P99 ITL (ms): 788.41
==================================================
============ Serving Benchmark Result ============
Successful requests: 100
Failed requests: 0
Benchmark duration (s): 74.38
Total input tokens: 102400
Total generated tokens: 12800
Request throughput (req/s): 1.34
Output token throughput (tok/s): 172.08
Peak output token throughput (tok/s): 400.00
Peak concurrent requests: 100.00
Total token throughput (tok/s): 1548.71
---------------Time to First Token----------------
Mean TTFT (ms): 17847.62
Median TTFT (ms): 17271.93
P99 TTFT (ms): 38251.50
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 401.05
Median TPOT (ms): 405.94
P99 TPOT (ms): 489.78
---------------Inter-token Latency----------------
Mean ITL (ms): 401.05
Median ITL (ms): 301.32
P99 ITL (ms): 846.77
==================================================
QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ
4 node (tp)
============ Serving Benchmark Result ============
Successful requests: 1
Failed requests: 0
Benchmark duration (s): 3.95
Total input tokens: 1024
Total generated tokens: 128
Request throughput (req/s): 0.25
Output token throughput (tok/s): 32.37
Peak output token throughput (tok/s): 33.00
Peak concurrent requests: 1.00
Total token throughput (tok/s): 291.31
---------------Time to First Token----------------
Mean TTFT (ms): 88.34
Median TTFT (ms): 88.34
P99 TTFT (ms): 88.34
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 30.44
Median TPOT (ms): 30.44
P99 TPOT (ms): 30.44
---------------Inter-token Latency----------------
Mean ITL (ms): 30.44
Median ITL (ms): 30.26
P99 ITL (ms): 32.65
==================================================
============ Serving Benchmark Result ============
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 15.86
Total input tokens: 10240
Total generated tokens: 1280
Request throughput (req/s): 0.63
Output token throughput (tok/s): 80.71
Peak output token throughput (tok/s): 110.00
Peak concurrent requests: 10.00
Total token throughput (tok/s): 726.39
---------------Time to First Token----------------
Mean TTFT (ms): 2363.01
Median TTFT (ms): 2390.65
P99 TTFT (ms): 3841.38
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 105.40
Median TPOT (ms): 105.28
P99 TPOT (ms): 116.69
---------------Inter-token Latency----------------
Mean ITL (ms): 105.40
Median ITL (ms): 96.24
P99 ITL (ms): 660.66
==================================================
GPT-OSS-20B
4 node (tp)
============ Serving Benchmark Result ============
Successful requests: 1
Failed requests: 0
Benchmark duration (s): 1.50
Total input tokens: 1024
Total generated tokens: 128
Request throughput (req/s): 0.67
Output token throughput (tok/s): 85.25
Peak output token throughput (tok/s): 83.00
Peak concurrent requests: 1.00
Total token throughput (tok/s): 767.26
---------------Time to First Token----------------
Mean TTFT (ms): 41.07
Median TTFT (ms): 41.07
P99 TTFT (ms): 41.07
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 11.50
Median TPOT (ms): 11.50
P99 TPOT (ms): 11.50
---------------Inter-token Latency----------------
Mean ITL (ms): 11.50
Median ITL (ms): 10.66
P99 ITL (ms): 20.37
==================================================
============ Serving Benchmark Result ============
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 3.70
Total input tokens: 10240
Total generated tokens: 1280
Request throughput (req/s): 2.71
Output token throughput (tok/s): 346.25
Peak output token throughput (tok/s): 460.00
Peak concurrent requests: 10.00
Total token throughput (tok/s): 3116.22
---------------Time to First Token----------------
Mean TTFT (ms): 643.37
Median TTFT (ms): 613.75
P99 TTFT (ms): 1084.12
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 23.79
Median TPOT (ms): 24.05
P99 TPOT (ms): 27.63
---------------Inter-token Latency----------------
Mean ITL (ms): 23.79
Median ITL (ms): 19.11
P99 ITL (ms): 179.76
==================================================
============ Serving Benchmark Result ============
Successful requests: 100
Failed requests: 0
Benchmark duration (s): 12.81
Total input tokens: 102400
Total generated tokens: 12800
Request throughput (req/s): 7.80
Output token throughput (tok/s): 999.02
Peak output token throughput (tok/s): 2800.00
Peak concurrent requests: 100.00
Total token throughput (tok/s): 8991.14
---------------Time to First Token----------------
Mean TTFT (ms): 3979.85
Median TTFT (ms): 3904.99
P99 TTFT (ms): 8486.11
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 64.12
Median TPOT (ms): 65.14
P99 TPOT (ms): 87.99
---------------Inter-token Latency----------------
Mean ITL (ms): 64.12
Median ITL (ms): 35.34
P99 ITL (ms): 187.69
==================================================
GPT-OSS-120B
8 node (tp)
============ Serving Benchmark Result ============
Successful requests: 1
Failed requests: 0
Benchmark duration (s): 2.07
Total input tokens: 1024
Total generated tokens: 128
Request throughput (req/s): 0.48
Output token throughput (tok/s): 61.86
Peak output token throughput (tok/s): 64.00
Peak concurrent requests: 1.00
Total token throughput (tok/s): 556.70
---------------Time to First Token----------------
Mean TTFT (ms): 47.08
Median TTFT (ms): 47.08
P99 TTFT (ms): 47.08
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 15.92
Median TPOT (ms): 15.92
P99 TPOT (ms): 15.92
---------------Inter-token Latency----------------
Mean ITL (ms): 15.92
Median ITL (ms): 14.10
P99 ITL (ms): 24.64
==================================================
============ Serving Benchmark Result ============
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 5.01
Total input tokens: 10240
Total generated tokens: 1280
Request throughput (req/s): 2.00
Output token throughput (tok/s): 255.50
Peak output token throughput (tok/s): 360.00
Peak concurrent requests: 10.00
Total token throughput (tok/s): 2299.53
---------------Time to First Token----------------
Mean TTFT (ms): 877.16
Median TTFT (ms): 928.99
P99 TTFT (ms): 1400.55
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 32.17
Median TPOT (ms): 31.83
P99 TPOT (ms): 37.86
---------------Inter-token Latency----------------
Mean ITL (ms): 32.17
Median ITL (ms): 25.62
P99 ITL (ms): 214.43
==================================================
============ Serving Benchmark Result ============
Successful requests: 100
Failed requests: 0
Benchmark duration (s): 21.57
Total input tokens: 102400
Total generated tokens: 12800
Request throughput (req/s): 4.64
Output token throughput (tok/s): 593.55
Peak output token throughput (tok/s): 1600.00
Peak concurrent requests: 100.00
Total token throughput (tok/s): 5341.94
---------------Time to First Token----------------
Mean TTFT (ms): 6357.06
Median TTFT (ms): 6132.10
P99 TTFT (ms): 13638.07
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 111.45
Median TPOT (ms): 114.05
P99 TPOT (ms): 149.78
---------------Inter-token Latency----------------
Mean ITL (ms): 111.45
Median ITL (ms): 63.22
P99 ITL (ms): 306.93
==================================================
4 node (tp)
============ Serving Benchmark Result ============
Successful requests: 1
Failed requests: 0
Benchmark duration (s): 1.79
Total input tokens: 1024
Total generated tokens: 128
Request throughput (req/s): 0.56
Output token throughput (tok/s): 71.54
Peak output token throughput (tok/s): 71.00
Peak concurrent requests: 1.00
Total token throughput (tok/s): 643.89
---------------Time to First Token----------------
Mean TTFT (ms): 37.10
Median TTFT (ms): 37.10
P99 TTFT (ms): 37.10
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 13.79
Median TPOT (ms): 13.79
P99 TPOT (ms): 13.79
---------------Inter-token Latency----------------
Mean ITL (ms): 13.79
Median ITL (ms): 13.73
P99 ITL (ms): 15.01
==================================================
============ Serving Benchmark Result ============
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 5.21
Total input tokens: 10240
Total generated tokens: 1280
Request throughput (req/s): 1.92
Output token throughput (tok/s): 245.57
Peak output token throughput (tok/s): 340.00
Peak concurrent requests: 10.00
Total token throughput (tok/s): 2210.11
---------------Time to First Token----------------
Mean TTFT (ms): 908.51
Median TTFT (ms): 962.65
P99 TTFT (ms): 1462.06
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 33.48
Median TPOT (ms): 33.10
P99 TPOT (ms): 38.47
---------------Inter-token Latency----------------
Mean ITL (ms): 33.48
Median ITL (ms): 29.98
P99 ITL (ms): 298.82
==================================================
============ Serving Benchmark Result ============
Successful requests: 100
Failed requests: 0
Benchmark duration (s): 24.67
Total input tokens: 102400
Total generated tokens: 12800
Request throughput (req/s): 4.05
Output token throughput (tok/s): 518.78
Peak output token throughput (tok/s): 1300.00
Peak concurrent requests: 100.00
Total token throughput (tok/s): 4669.03
---------------Time to First Token----------------
Mean TTFT (ms): 6719.06
Median TTFT (ms): 6547.16
P99 TTFT (ms): 14641.29
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 130.26
Median TPOT (ms): 132.70
P99 TPOT (ms): 167.78
---------------Inter-token Latency----------------
Mean ITL (ms): 130.26
Median ITL (ms): 84.86
P99 ITL (ms): 424.07
==================================================
zai-org/GLM-4.6-FP8
4 node (tp)
============ Serving Benchmark Result ============
Successful requests: 1
Failed requests: 0
Benchmark duration (s): 8.40
Total input tokens: 1024
Total generated tokens: 128
Request throughput (req/s): 0.12
Output token throughput (tok/s): 15.23
Peak output token throughput (tok/s): 16.00
Peak concurrent requests: 1.00
Total token throughput (tok/s): 137.08
---------------Time to First Token----------------
Mean TTFT (ms): 224.50
Median TTFT (ms): 224.50
P99 TTFT (ms): 224.50
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 64.40
Median TPOT (ms): 64.40
P99 TPOT (ms): 64.40
---------------Inter-token Latency----------------
Mean ITL (ms): 64.40
Median ITL (ms): 64.32
P99 ITL (ms): 66.00
==================================================
============ Serving Benchmark Result ============
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 40.67
Total input tokens: 10240
Total generated tokens: 1280
Request throughput (req/s): 0.25
Output token throughput (tok/s): 31.48
Peak output token throughput (tok/s): 40.00
Peak concurrent requests: 10.00
Total token throughput (tok/s): 283.28
---------------Time to First Token----------------
Mean TTFT (ms): 5797.18
Median TTFT (ms): 5759.51
P99 TTFT (ms): 8694.79
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 272.29
Median TPOT (ms): 272.72
P99 TPOT (ms): 301.66
---------------Inter-token Latency----------------
Mean ITL (ms): 272.29
Median ITL (ms): 257.01
P99 ITL (ms): 1718.29
==================================================
============ Serving Benchmark Result ============
Successful requests: 100
Failed requests: 0
Benchmark duration (s): 142.28
Total input tokens: 102400
Total generated tokens: 12800
Request throughput (req/s): 0.70
Output token throughput (tok/s): 89.96
Peak output token throughput (tok/s): 200.00
Peak concurrent requests: 100.00
Total token throughput (tok/s): 809.67
---------------Time to First Token----------------
Mean TTFT (ms): 38421.92
Median TTFT (ms): 36869.57
P99 TTFT (ms): 81050.88
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 749.07
Median TPOT (ms): 761.59
P99 TPOT (ms): 962.73
---------------Inter-token Latency----------------
Mean ITL (ms): 749.07
Median ITL (ms): 514.18
P99 ITL (ms): 1774.77
==================================================
nvidia/Llama-4-Scout-17B-16E-Instruct-NVFP4
4 node (tp)
============ Serving Benchmark Result ============
Successful requests: 1
Failed requests: 0
Benchmark duration (s): 4.58
Total input tokens: 1023
Total generated tokens: 128
Request throughput (req/s): 0.22
Output token throughput (tok/s): 27.93
Peak output token throughput (tok/s): 30.00
Peak concurrent requests: 1.00
Total token throughput (tok/s): 251.18
---------------Time to First Token----------------
Mean TTFT (ms): 382.88
Median TTFT (ms): 382.88
P99 TTFT (ms): 382.88
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 33.07
Median TPOT (ms): 33.07
P99 TPOT (ms): 33.07
---------------Inter-token Latency----------------
Mean ITL (ms): 33.07
Median ITL (ms): 30.99
P99 ITL (ms): 43.77
==================================================
============ Serving Benchmark Result ============
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 11.95
Total input tokens: 10230
Total generated tokens: 1280
Request throughput (req/s): 0.84
Output token throughput (tok/s): 107.10
Peak output token throughput (tok/s): 150.00
Peak concurrent requests: 10.00
Total token throughput (tok/s): 963.11
---------------Time to First Token----------------
Mean TTFT (ms): 1712.12
Median TTFT (ms): 1843.62
P99 TTFT (ms): 2657.49
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 76.48
Median TPOT (ms): 74.81
P99 TPOT (ms): 84.39
---------------Inter-token Latency----------------
Mean ITL (ms): 76.48
Median ITL (ms): 67.11
P99 ITL (ms): 394.16
==================================================
============ Serving Benchmark Result ============
Successful requests: 100
Failed requests: 0
Benchmark duration (s): 41.58
Total input tokens: 102300
Total generated tokens: 12800
Request throughput (req/s): 2.40
Output token throughput (tok/s): 307.83
Peak output token throughput (tok/s): 800.00
Peak concurrent requests: 100.00
Total token throughput (tok/s): 2768.06
---------------Time to First Token----------------
Mean TTFT (ms): 10097.28
Median TTFT (ms): 9424.95
P99 TTFT (ms): 22795.64
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 222.56
Median TPOT (ms): 227.89
P99 TPOT (ms): 269.39
---------------Inter-token Latency----------------
Mean ITL (ms): 222.56
Median ITL (ms): 140.08
P99 ITL (ms): 618.16
==================================================
nvidia/Llama-3.3-70B-Instruct-NVFP4
4 node (tp)
============ Serving Benchmark Result ============
Successful requests: 1
Failed requests: 0
Benchmark duration (s): 10.03
Total input tokens: 1023
Total generated tokens: 128
Request throughput (req/s): 0.10
Output token throughput (tok/s): 12.77
Peak output token throughput (tok/s): 14.00
Peak concurrent requests: 1.00
Total token throughput (tok/s): 114.79
---------------Time to First Token----------------
Mean TTFT (ms): 312.52
Median TTFT (ms): 312.52
P99 TTFT (ms): 312.52
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 76.49
Median TPOT (ms): 76.49
P99 TPOT (ms): 76.49
---------------Inter-token Latency----------------
Mean ITL (ms): 76.49
Median ITL (ms): 73.79
P99 ITL (ms): 87.22
==================================================
============ Serving Benchmark Result ============
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 21.36
Total input tokens: 10230
Total generated tokens: 1280
Request throughput (req/s): 0.47
Output token throughput (tok/s): 59.93
Peak output token throughput (tok/s): 90.00
Peak concurrent requests: 10.00
Total token throughput (tok/s): 538.92
---------------Time to First Token----------------
Mean TTFT (ms): 3472.04
Median TTFT (ms): 3544.04
P99 TTFT (ms): 5023.57
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 138.56
Median TPOT (ms): 138.05
P99 TPOT (ms): 151.98
---------------Inter-token Latency----------------
Mean ITL (ms): 138.56
Median ITL (ms): 125.58
P99 ITL (ms): 839.30
==================================================
============ Serving Benchmark Result ============
Successful requests: 100
Failed requests: 0
Benchmark duration (s): 69.30
Total input tokens: 102300
Total generated tokens: 12800
Request throughput (req/s): 1.44
Output token throughput (tok/s): 184.69
Peak output token throughput (tok/s): 600.00
Peak concurrent requests: 100.00
Total token throughput (tok/s): 1660.78
---------------Time to First Token----------------
Mean TTFT (ms): 23961.51
Median TTFT (ms): 23834.79
P99 TTFT (ms): 47879.51
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 331.80
Median TPOT (ms): 333.44
P99 TPOT (ms): 481.50
---------------Inter-token Latency----------------
Mean ITL (ms): 331.80
Median ITL (ms): 182.62
P99 ITL (ms): 1731.80
==================================================
Summary of Results
-
GPT-OSS-20B is the throughput champion. It delivers the highest raw performance across all concurrency levels, hitting ~9k tok/s total at 100 concurrent requests with remarkably low TPOT (64ms). Single-request latency is excellent (TTFT 41ms), and it scales gracefully under load.
-
GPT-OSS-120B offers the best balance for a large model. It has the snappiest single-request behavior (TTFT 37ms) and maintains reasonable latency even at scale, reaching ~4.7k tok/s total at 100 concurrent. TPOT stays controlled (130ms at 100 reqs) compared to other big models.
-
Qwen3-VL-32B-FP8 is solid for moderate workloads. Single-request latency is acceptable (TTFT 84ms), and it reaches ~3k tok/s total at 100 concurrent. However, TTFT climbs significantly under load (β11.6s at 100 reqs), making it feel sluggish for interactive use at high concurrency.
-
Llama-4-Scout-17B-16E-NVFP4 performs similarly to Qwen3-VL-32B under load. Comparable scaling behavior (TTFT β10s at 100 reqs, ~2.8k tok/s total), though single-request TTFT is higher (383ms) due to MoE routing overhead.
-
Qwen3-VL-235B-A22B-AWQ improves significantly over the FP8 variant at low concurrency. Single-request TPOT drops from 44ms to 30ms, and TTFT from 128ms to 88ms. At 10 concurrent, itβs still faster (TPOT 105ms vs 168ms), making AWQ worthwhile for latency-sensitive deployments of this model.
-
Qwen3-VL-235B-A22B-FP8 is strongly latency-bound. Acceptable at single requests, but TTFT explodes with concurrency (β17.8s at 100 reqs) and TPOT becomes very high (401ms). Throughput caps around ~1.5k tok/s total.
-
Llama-3.3-70B-NVFP4 struggles with the FP4 quantization overhead. Despite being smaller than GPT-OSS-120B, itβs slower across the boardβhigher TTFT, worse TPOT, and lower throughput (~1.7k tok/s at 100 concurrent).
-
GLM-4.6-FP8 degrades the hardest under load. TTFT becomes extreme (β38s at 100 reqs) and TPOT balloons to 749ms. Not suitable for interactive or high-concurrency serving.
Now we need NVIDIA to release DGX desktop 200GbE switch for handling 4 Sparks
Nice! So, it still scales well with 4 nodes. Pretty much doubles the performance compared to 2 node cluster for dense models and still scales nicely for βfasterβ ones.


