I was able to connect two sparks finally, thanks.
I can run Qwen3.5-35b on the cluster with no issue, but I am getting issues with dense models on the cluster, with single node it response good, but cluster it generates tokens but it seems in a loop, the tokens are corrupted, Qwen3.5-2b on the cluster, I don’t know why it use such big vram with such small model?
`Fri Apr 3 15:25:55 2026
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.142 Driver Version: 580.142 CUDA Version: 13.1 |
±----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GB10 On | 0000000F:01:00.0 On | N/A |
| N/A 61C P0 29W / N/A | Not Supported | 88% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1212 C python3 223MiB |
| 0 N/A N/A 1276 C sglang::scheduler_TP0 93513MiB |
±----------------------------------------------------------------------------------------+`
Fri Apr 3 15:26:29 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.142 Driver Version: 580.142 CUDA Version: 13.1 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GB10 On | 0000000F:01:00.0 Off | N/A |
| N/A 59C P0 23W / N/A | Not Supported | 90% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 131 C python3 170MiB |
| 0 N/A N/A 230 C sglang::scheduler_TP1 93537MiB |
+-----------------------------------------------------------------------------------------+
below is the logs on master:
[2026-04-03 15:32:30 TP0] Decode batch, #running-req: 1, #full token: 26427, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 42.25, #queue-req: 0 [2026-04-03 15:32:31 TP0] Decode batch, #running-req: 1, #full token: 26467, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 43.38, #queue-req: 0 [2026-04-03 15:32:32 TP0] Decode batch, #running-req: 1, #full token: 26507, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 40.46, #queue-req: 0 [2026-04-03 15:32:33 TP0] Decode batch, #running-req: 1, #full token: 26547, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 41.14, #queue-req: 0 [2026-04-03 15:32:34 TP0] Decode batch, #running-req: 1, #full token: 26587, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 40.92, #queue-req: 0 [2026-04-03 15:32:35 TP0] Decode batch, #running-req: 1, #full token: 26627, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 41.96, #queue-req: 0 [2026-04-03 15:32:36 TP0] Decode batch, #running-req: 1, #full token: 26667, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 41.97, #queue-req: 0 [2026-04-03 15:32:36 TP0] Decode batch, #running-req: 1, #full token: 26707, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 41.47, #queue-req: 0 [2026-04-03 15:32:37 TP0] Decode batch, #running-req: 1, #full token: 26747, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 40.66, #queue-req: 0 [2026-04-03 15:32:38 TP0] Decode batch, #running-req: 1, #full token: 26787, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 41.91, #queue-req: 0 [2026-04-03 15:32:39 TP0] Decode batch, #running-req: 1, #full token: 26827, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 41.15, #queue-req: 0 [2026-04-03 15:32:40] e[32mINFOe[0m: 100.109.56.33:59816 - "e[1mGET /metrics HTTP/1.1e[0m" e[32m200 OKe[0m [2026-04-03 15:32:40 TP0] Decode batch, #running-req: 1, #full token: 26867, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 42.71, #queue-req: 0 [2026-04-03 15:32:41 TP0] Decode batch, #running-req: 1, #full token: 26907, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 41.62, #queue-req: 0 [2026-04-03 15:32:42 TP0] Decode batch, #running-req: 1, #full token: 26947, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 41.73, #queue-req: 0 [2026-04-03 15:32:43 TP0] Decode batch, #running-req: 1, #full token: 26987, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 41.84, #queue-req: 0 [2026-04-03 15:32:44 TP0] Decode batch, #running-req: 1, #full token: 27027, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 40.28, #queue-req: 0 [2026-04-03 15:32:45 TP0] Decode batch, #running-req: 1, #full token: 27067, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 39.29, #queue-req: 0 [2026-04-03 15:32:46 TP0] Decode batch, #running-req: 1, #full token: 27107, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 35.12, #queue-req: 0 [2026-04-03 15:32:48 TP0] Decode batch, #running-req: 1, #full token: 27147, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 34.52, #queue-req: 0 [2026-04-03 15:32:49 TP0] Decode batch, #running-req: 1, #full token: 27187, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 34.70, #queue-req: 0 [2026-04-03 15:32:50] e[32mINFOe[0m: 100.109.56.33:38878 - "e[1mGET /metrics HTTP/1.1e[0m" e[32m200 OKe[0m [2026-04-03 15:32:50 TP0] Decode batch, #running-req: 1, #full token: 27227, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 35.32, #queue-req: 0 [2026-04-03 15:32:51 TP0] Decode batch, #running-req: 1, #full token: 27267, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 36.75, #queue-req: 0 [2026-04-03 15:32:52 TP0] Decode batch, #running-req: 1, #full token: 27307, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 37.04, #queue-req: 0 [2026-04-03 15:32:53 TP0] Decode batch, #running-req: 1, #full token: 27347, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 35.37, #queue-req: 0 [2026-04-03 15:32:54 TP0] Decode batch, #running-req: 1, #full token: 27387, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 36.12, #queue-req: 0 [2026-04-03 15:32:55 TP0] Decode batch, #running-req: 1, #full token: 27427, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 36.27, #queue-req: 0 [2026-04-03 15:32:56 TP0] Decode batch, #running-req: 1, #full token: 27467, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 38.29, #queue-req: 0 [2026-04-03 15:32:57 TP0] Decode batch, #running-req: 1, #full token: 27507, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 37.97, #queue-req: 0 [2026-04-03 15:32:59 TP0] Decode batch, #running-req: 1, #full token: 27547, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 36.89, #queue-req: 0 [2026-04-03 15:33:00 TP0] Decode batch, #running-req: 1, #full token: 27587, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 37.83, #queue-req: 0 [2026-04-03 15:33:00] e[32mINFOe[0m: 100.109.56.33:41102 - "e[1mGET /metrics HTTP/1.1e[0m" e[32m200 OKe[0m [2026-04-03 15:33:01 TP0] Decode batch, #running-req: 1, #full token: 27627, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 38.17, #queue-req: 0 [2026-04-03 15:33:02 TP0] Decode batch, #running-req: 1, #full token: 27667, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 37.50, #queue-req: 0 [2026-04-03 15:33:03 TP0] Decode batch, #running-req: 1, #full token: 27707, full token usage: 0.00, mamba num: