I tried and run the moonshotai/Kimi-K2.6 and Qwen 3.5F-397B -FP8 models on a a cluster of 8 nodes using eugr vllm with tp 8.
Qwen 3.5-397b-FP8 on 4 nodes tp 4: 31t/s (cca 400GB)
Qwen 3.5-397b-FP8 on 8 nodes tp 8: 35t/s (cca 400GB)
Kimi-K2.6 on 8 nodes tp 8: 12t/s (cca 600GB)
It works but Kimi-K2.6 is slower than expected, token generation 12-13 tok/s.
And also I would expect a bigger bump from 4 to 8 nodes for Qwen 397b fp8
I am using a Mikrotik with 4 x 400 to 2x200 cables, one in each node.
What I observed is that adding --no-ray, increases the speed 4-5% on 4 nodes, but on 8 nodes, it decreases the speed with cca 10%
Any improvements ideas for me to check?
recipe_version: ā1ā
name: Kimi-K2.6-8xCluster
description: vLLM serving Kimi-K2.6 on 8-node cluster
Local model path (via NFS mount)
model: /home/ciprian/models/Kimi-K2.6
Container image to use
container: vllm-node-tf5
container: vllm-node-tf5
build_args:
- --tf5
Mods to apply
mods:
Default settings (can be overridden via CLI)
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 8
gpu_memory_utilization: 0.79
max_model_len: auto
max_num_seqs: 4
max_num_batched_tokens: 8192
Environment variables
env:
OMP_NUM_THREADS: 4
HF_HUB_OFFLINE: 1
TRANSFORMERS_OFFLINE: 1
VLLM_MARLIN_USE_ATOMIC_ADD: 1
The vLLM serve command template
command: |
vllm serve /root/models/Kimi-K2.6
ātensor-parallel-size {tensor_parallel}
ādistributed-executor-backend ray
āgpu-memory-utilization {gpu_memory_utilization}
āmax-model-len {max_model_len}
āmax-num-seqs {max_num_seqs}
āmax-num-batched-tokens {max_num_batched_tokens}
āenable-prefix-caching
āenable-chunked-prefill
ātrust-remote-code
ātool-call-parser kimi_k2
āenable-auto-tool-choice
āreasoning-parser kimi_k2
āserved-model-name āKimi-K2.6ā
āenable-sleep-mode
āno-enable-expert-parallel
ādefault-chat-template-kwargs ā{{āenable_thinkingā: true}}ā
ālimit-mm-per-prompt ā{{āvideoā: 0}}ā
āoverride-generation-config ā{{ātemperatureā: 0.95, ātop_pā: 0.95}}ā
āmm-encoder-tp-mode data
āhost {host}
āport {port}
Any improvemets ideas for me to check?