Kimi 2.6 and Qwen 3.5-397B -FP8 on 8xGB10 cluster

I tried and run the moonshotai/Kimi-K2.6 and Qwen 3.5F-397B -FP8 models on a a cluster of 8 nodes using eugr vllm with tp 8.

Qwen 3.5-397b-FP8 on 4 nodes tp 4: 31t/s (cca 400GB)
Qwen 3.5-397b-FP8 on 8 nodes tp 8: 35t/s (cca 400GB)
Kimi-K2.6 on 8 nodes tp 8: 12t/s (cca 600GB)

It works but Kimi-K2.6 is slower than expected, token generation 12-13 tok/s.
And also I would expect a bigger bump from 4 to 8 nodes for Qwen 397b fp8

I am using a Mikrotik with 4 x 400 to 2x200 cables, one in each node.

What I observed is that adding --no-ray, increases the speed 4-5% on 4 nodes, but on 8 nodes, it decreases the speed with cca 10%

Any improvements ideas for me to check?

recipe_version: ā€œ1ā€
name: Kimi-K2.6-8xCluster
description: vLLM serving Kimi-K2.6 on 8-node cluster

Local model path (via NFS mount)

model: /home/ciprian/models/Kimi-K2.6

Container image to use

container: vllm-node-tf5

container: vllm-node-tf5

build_args:

  • --tf5

Mods to apply

mods:

Default settings (can be overridden via CLI)

defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 8
gpu_memory_utilization: 0.79
max_model_len: auto
max_num_seqs: 4
max_num_batched_tokens: 8192

Environment variables

env:
OMP_NUM_THREADS: 4
HF_HUB_OFFLINE: 1
TRANSFORMERS_OFFLINE: 1
VLLM_MARLIN_USE_ATOMIC_ADD: 1

The vLLM serve command template

command: |
vllm serve /root/models/Kimi-K2.6
–tensor-parallel-size {tensor_parallel}
–distributed-executor-backend ray
–gpu-memory-utilization {gpu_memory_utilization}
–max-model-len {max_model_len}
–max-num-seqs {max_num_seqs}
–max-num-batched-tokens {max_num_batched_tokens}
–enable-prefix-caching
–enable-chunked-prefill
–trust-remote-code
–tool-call-parser kimi_k2
–enable-auto-tool-choice
–reasoning-parser kimi_k2
–served-model-name ā€˜Kimi-K2.6’
–enable-sleep-mode
–no-enable-expert-parallel
–default-chat-template-kwargs ā€˜{{ā€œenable_thinkingā€: true}}’
–limit-mm-per-prompt ā€˜{{ā€œvideoā€: 0}}’
–override-generation-config ā€˜{{ā€œtemperatureā€: 0.95, ā€œtop_pā€: 0.95}}’
–mm-encoder-tp-mode data
–host {host}
–port {port}

Any improvemets ideas for me to check?

Did you configure flow control and such on the switch as per the recommendations?

i haven’t found recommended settings for mikrotik crs 804 other than setting 200gbit spees and mtu 9216 and fec91, rest is default, did i miss anything important?

So 2 nodes around 25tps, 4 nodes about 30, 8 nodes about 35tps.

Looks like 397b is not hugely improved with more than 2 nodes?

Wondering about DS4 and whether that’s more suitable

Configure Priority-based Flow Control (PFC) or ECN (Explicit Congestion Notification), that should reduce but not remove the latency and help with your T/s.

if you only use the switch for the ib over eth it should not matter ? - that is just if that is attached to your other network which it should not be

as for the poster - the gains from adding more nodes in an MOE are often very small - tp helps more with dense models but even that has a sealing

from 2 to 4 it helped from 4 to 8 much less, and the --no-ray option that brings some gains in 2 and 4 nodes setup, in 8 nodes is slower that with ray. Maybe there are different settings needed

2 nodes is 4bits, 4 nodes is fp8..

PFC and ECN would require setting DCB (data center bridging) and CRS804-DD while being a nice little switch that does 400G it’s networking stack is not supporting any of the DCB features.

We’re doing RoCE only when clustering Sparks, and DCB won’t help but complicate the setup.

I’m curious about using both ports at 200Gbs because everything I’ve read the Spark/GX10 backplane can’t push enough bandwidth to use both ports at 200Gb

Are you sure?

it arregates to 200g just fine on each port

I’ve also just switched over to tp=4 for 397B, as I was getting some issues running it on tp=2.
I’ve got about 40 tokens a sec, but I have some stability issues. I’m uncertain if its my recipe or if its because I have a few other things on 2 of the sparks, that is also consuming memory, and making those ones run close to max memory and killing the cluster.

I would be happy to share my recipe in a day or two, once I can confirm that all is well.