Kimi 2.6 and Qwen 3.5-397B -FP8 on 8xGB10 cluster

ciprianveg · May 8, 2026, 2:09pm

I tried and run the moonshotai/Kimi-K2.6 and Qwen 3.5F-397B -FP8 models on a a cluster of 8 nodes using eugr vllm with tp 8.

Qwen 3.5-397b-FP8 on 4 nodes tp 4: 31t/s (cca 400GB)
Qwen 3.5-397b-FP8 on 8 nodes tp 8: 35t/s (cca 400GB)
Kimi-K2.6 on 8 nodes tp 8: 12t/s (cca 600GB)

It works but Kimi-K2.6 is slower than expected, token generation 12-13 tok/s.
And also I would expect a bigger bump from 4 to 8 nodes for Qwen 397b fp8

I am using a Mikrotik with 4 x 400 to 2x200 cables, one in each node.

What I observed is that adding --no-ray, increases the speed 4-5% on 4 nodes, but on 8 nodes, it decreases the speed with cca 10%

Any improvements ideas for me to check?

recipe_version: “1”
name: Kimi-K2.6-8xCluster
description: vLLM serving Kimi-K2.6 on 8-node cluster

Local model path (via NFS mount)

model: /home/ciprian/models/Kimi-K2.6

Container image to use

container: vllm-node-tf5

build_args:

--tf5

Mods to apply

mods:

Default settings (can be overridden via CLI)

defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 8
gpu_memory_utilization: 0.79
max_model_len: auto
max_num_seqs: 4
max_num_batched_tokens: 8192

Environment variables

env:
OMP_NUM_THREADS: 4
HF_HUB_OFFLINE: 1
TRANSFORMERS_OFFLINE: 1
VLLM_MARLIN_USE_ATOMIC_ADD: 1

The vLLM serve command template

command: |
vllm serve /root/models/Kimi-K2.6
–tensor-parallel-size {tensor_parallel}
–distributed-executor-backend ray
–gpu-memory-utilization {gpu_memory_utilization}
–max-model-len {max_model_len}
–max-num-seqs {max_num_seqs}
–max-num-batched-tokens {max_num_batched_tokens}
–enable-prefix-caching
–enable-chunked-prefill
–trust-remote-code
–tool-call-parser kimi_k2
–enable-auto-tool-choice
–reasoning-parser kimi_k2
–served-model-name ‘Kimi-K2.6’
–enable-sleep-mode
–no-enable-expert-parallel
–default-chat-template-kwargs ‘{{“enable_thinking”: true}}’
–limit-mm-per-prompt ‘{{“video”: 0}}’
–override-generation-config ‘{{“temperature”: 0.95, “top_p”: 0.95}}’
–mm-encoder-tp-mode data
–host {host}
–port {port}

Any improvemets ideas for me to check?

mashie · May 9, 2026, 12:44pm

Did you configure flow control and such on the switch as per the recommendations?

ciprianveg · May 9, 2026, 12:49pm

i haven’t found recommended settings for mikrotik crs 804 other than setting 200gbit spees and mtu 9216 and fec91, rest is default, did i miss anything important?

jc2375 · May 9, 2026, 12:52pm

So 2 nodes around 25tps, 4 nodes about 30, 8 nodes about 35tps.

Looks like 397b is not hugely improved with more than 2 nodes?

Wondering about DS4 and whether that’s more suitable

mashie · May 9, 2026, 1:30pm

Configure Priority-based Flow Control (PFC) or ECN (Explicit Congestion Notification), that should reduce but not remove the latency and help with your T/s.

mrDragonFox · May 9, 2026, 1:35pm

if you only use the switch for the ib over eth it should not matter ? - that is just if that is attached to your other network which it should not be

as for the poster - the gains from adding more nodes in an MOE are often very small - tp helps more with dense models but even that has a sealing

ciprianveg · May 9, 2026, 1:53pm

from 2 to 4 it helped from 4 to 8 much less, and the --no-ray option that brings some gains in 2 and 4 nodes setup, in 8 nodes is slower that with ray. Maybe there are different settings needed

ciprianveg · May 9, 2026, 1:59pm

2 nodes is 4bits, 4 nodes is fp8..

elsaco · May 9, 2026, 2:31pm

PFC and ECN would require setting DCB (data center bridging) and CRS804-DD while being a nice little switch that does 400G it’s networking stack is not supporting any of the DCB features.

We’re doing RoCE only when clustering Sparks, and DCB won’t help but complicate the setup.

robert287 · May 9, 2026, 4:28pm

I’m curious about using both ports at 200Gbs because everything I’ve read the Spark/GX10 backplane can’t push enough bandwidth to use both ports at 200Gb

mashie · May 9, 2026, 7:12pm

Are you sure?

mrDragonFox · May 9, 2026, 9:04pm

it arregates to 200g just fine on each port

truxnor · May 9, 2026, 9:30pm

I’ve also just switched over to tp=4 for 397B, as I was getting some issues running it on tp=2.
I’ve got about 40 tokens a sec, but I have some stability issues. I’m uncertain if its my recipe or if its because I have a few other things on 2 of the sparks, that is also consuming memory, and making those ones run close to max memory and killing the cluster.

I would be happy to share my recipe in a day or two, once I can confirm that all is well.

Topic		Replies	Views
6x Spark setup DGX Spark / GB10	112	8993	April 25, 2026
Qwen3.5-397B-A17B-int4-AutoRound - 4 x db10 node - updated results 37 - 94 tok/s DGX Spark / GB10 clustering , spark	26	1745	April 28, 2026
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1382	February 13, 2026
Two multi-node DGX Spark wins: RoCE 2× inference throughput + Qwen3.5-397B-A17B-NVFP4 serving (with SM121 CUTLASS patch) DGX Spark / GB10 Projects	4	617	April 16, 2026
Why 200 tok/s is new normal? — TP=2 Does Scale After All DGX Spark / GB10 Projects	20	1392	March 19, 2026
DGX Spark Multi-Node LLM Inference Report for Qwen3-235B model DGX Spark / GB10 nim , llama	35	2191	May 1, 2026
Qwen3.5-397B-A17B run in dual spark! but I have a concern DGX Spark / GB10	229	7366	April 20, 2026
Running Qwen/Qwen3.5-35B-A3B-FP8 on a cluster DGX Spark / GB10	19	372	March 21, 2026
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	5141	December 9, 2025
Nvidia releases Kimi K2.5 NVFP4! (1T, 591GB) DGX Spark / GB10	16	3696	May 7, 2026