Sharing two findings from our 4-node DGX Spark (GB10, SM121) cluster that may save others some time. Both relate to
running large NVFP4 MoE models in a multi-node SGLang setup over 200 GbE QSFP (ConnectX-7 NICs, MikroTik CRS812
switch, K3s Kubernetes, SR-IOV enabled).
Finding 1: RoCE vs. TCP socket — NCCL transport matters a lot
We spent time assuming our cluster was GPU-compute-bound because nvidia-smi showed 100% GPU utilization in both
configurations. That turned out to be misleading — the GPU was busy-waiting inside NCCL’s socket kernel while the
Grace ARM64 CPU processed the TCP stack. The real bottleneck was the CPU, which tops out around 2 GB/s on 200 GbE
over TCP.
Switching NCCL from TCP socket transport to RoCE (RDMA over Converged Ethernet) on the ConnectX-7 NICs produced
dramatic results:
At n=8: internal decode throughput reaches ~105 tok/s
This applies to any multi-node DGX Spark setup using NCCL, not just SGLang.
Three things were required to get RoCE working with SR-IOV VFs inside Kubernetes pods:
VF interfaces must have host-side IPv4 addresses configured (we use netplan, same IP that Multus IPAM assigns
inside the pod). Without this, the RoCE GID table only contains link-local fe80:: entries and ibv_modify_qp fails
with “Network is unreachable”.
Pods must run privileged. The host-device CNI moves the network interface into the pod namespace but does NOT
move /dev/infiniband/* character devices. Without privileged mode, NCCL cannot open the RDMA verbs devices and
silently falls back to socket transport — no error, just 2 GB/s.
NetworkAttachmentDefinitions must exist for all VF indices in use. Missing NADs cause pod scheduling failures
that are easy to misdiagnose.
The silent fallback in point 2 is particularly painful to debug — add NCCL_DEBUG=INFO to your pod env and look for
“IB” vs “Socket” in the transport selection log lines.
Finding 2: Qwen3.5-397B-A17B-NVFP4 running on 4× DGX Spark (TP=4, EP=1)
We have nvidia/Qwen3.5-397B-A17B-NVFP4 (397B parameters, 17B active per token, 512 experts, ~234 GB) running stably
across all four nodes. This is the largest NVFP4 MoE model we have successfully served on DGX Spark. Memory
footprint is approximately 59 GB per GPU, well within the 128 GB GB10 limit.
Best performing configuration from our tuning runs: triton MoE kernel + triton attention + fi_cutlass FP4 backend +
CUDA graphs enabled.
n=1: 22.2 tok/s
n=4: 67.2 tok/s
n=8: 101.3 tok/s
SM121 CUTLASS patch required. Out of the box, the CUTLASS FP4 JIT kernel compilation fails on SM121 because
cute/mma.py does not list sm_120a or sm_121a in BlockScaledMmaOp.admissible_archs. We apply a monkey-patch at
container startup that adds those two architecture strings. Without the patch, the fi_cutlass FP4 backend falls
back or errors out entirely.
The patch is a two-line addition to admissible_archs in the BlockScaledMmaOp class — happy to share the exact
snippet if useful.
Hope this helps other DGX Spark cluster operators. Both issues took us a while to track down and the answers were
not obvious from the error messages (or lack thereof). Happy to answer questions.
Follow-up to the original post. Same setup (4× DGX Spark, RoCE NCCL via SR-IOV VFs, scitrera/dgx-spark-sglang:0.5.10, nvidia/Qwen3.5-397B-A17B-NVFP4), this time EP=4 instead of EP=1. The goal was to see whether Expert Parallelism works on
GB10 and whether the MoE backends that weren’t the winner in the EP=1 matrix fare
better at EP=4. Results are surprisingly differentiated.
Same 36 permutations as the main post (3 MoE runners × 2 attention backends × 2 fp4
GEMM backends × 3 CUDA graph modes), all with tp=4, pp=1, ep=4, nnodes=4, quantization=modelopt_fp4, mem_fraction_static=0.80. All previously documented
runtime patches in sglang_launch.sh active (cute/mma sm_120a + sm_121a
admissible_archs, cutlass_moe.py a_map/c_map zero-init + topk_weights mask,
modelopt_quant EP-aware input_scale slicing, moe_wna16 qzeros EP remapping).
Results overview (36 / 36 complete)
MoE backend block
Tests
STABLE
FAIL
triton MoE
1–12
8
4 (eager-mode output collapse)
flashinfer_cutlass MoE
13–24
0
12 (all SM121 illegal instruction)
cutlass direct MoE
25–36
8
4 (eager-mode output collapse)
Total
36
16
20
Top 5 n=8 peak tok/s (stable rows only)
Rank
Test
MoE
attn
fp4 GEMM
CUDA graph mode
n=8 peak
Δ vs EP=1 winner
1
3
triton
fi
fi_cutlass
graphs + piecewise
98.5
−3.4%
2
1
triton
fi
fi_cutlass
graphs on
96.1
−5.8%
3
33
cutlass
fi
fi_cudnn
graphs + piecewise
95.2
−6.7%
4
34
cutlass
triton
fi_cudnn
graphs on
95.1
−6.8%
5
36
cutlass
triton
fi_cudnn
graphs + piecewise
94.6
−7.3%
EP=1 winner from the main post: Test 28 in the EP=1 matrix
(cutlass / triton / fi_cutlass / graphs on) at 102.0 tok/s n=8.
EP=4 consistently loses 3–7 % against EP=1 on this cluster. The overhead comes
from the extra inter-node all-to-alls per MoE layer — RoCE keeps it viable, but the
per-layer collective sync cost isn’t zero. One less reason to wrestle with EP>1 if
EP=1 fits — but see the three findings below, which are worth knowing about either
way.
Finding 1 — disable_cuda_graph=true is completely broken on the cutlass_moe_fp4 path (8 of 8 rows)
Every single eager-mode row in this matrix (Tests 2, 5, 8, 11 in the triton-MoE
block + Tests 26, 29, 32, 35 in the cutlass-direct block) collapses after a short
warmup phase onto a single token — usually !. The bench harness initially
reports these as “STABLE” at a bogus 142–156 tok/s because the model races through max_tokens=3072 in record time (all requests identical tokens/sec, identical
thinking-token counts, content_tokens_est=None). This is exactly the signature
that we had to manually fish out of the per-pod Loki logs for individual tests in
the EP=1 run.
Mechanistically clean: both MoE runners (triton and cutlass direct) route
through the same cutlass_moe_fp4 combine path under NVFP4. In graph mode, CUDA
graph capture freezes a working kernel variant and keeps replaying it unchanged →
stable. In eager mode the dispatch re-runs per step, and the unpatched numerics
inside the combine kernel (apply_shuffle_mul_sum) hit the reader. The
monkey-patches in sglang_launch.sh (a_map/c_map zero-init, topk_weights.masked_fill) suppress the crash but do not fix the math.
Takeaway for anyone deploying this stack: disable_cuda_graph: false
(= graphs on) is not optional for any MoE runner that falls through cutlass_moe_fp4. The only way to use eager mode safely would be a proper upstream
fix.
Finding 2 — flashinfer_cutlass MoE crashes uniformly on SM121 / Blackwell GB10 (12 of 12 rows)
The entire fi_cutlass MoE region (Tests 13–24: all 3 graph modes × 2 attention
backends × 2 fp4 GEMM backends) is unusable at EP=4 on this cluster. Every row
shows the same failure pattern:
n=1 runs cleanly and delivers a full coherent response at ~19 tok/s (verified
via Loki pod-stdout — real TCP-vs-UDP / CAP theorem / cryptography briefs).
n=4 or n=8 then kills a random rank (sometimes head, sometimes a worker — no
node affinity pattern).
Error at the first sync point after the crash: torch.AcceleratorError: CUDA error: an illegal instruction was encountered
(cudaErrorIllegalInstruction).
The stack trace lands in one of three places depending on timing: process_batch_result_decode → next_token_ids.tolist(), the NCCL watchdog
(ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal), or directly at the
CUDA graph replay boundary (torch/cuda/graphs.py:139 super().replay()). All three
are just different points-of-observation for the same async-propagated fault; the
actual offending instruction sits inside the fi_cutlass MoE forward kernel
itself.
Eliminated by process-of-elimination: the fault is not in the attention backend
(fi vs triton — identical result), not in the fp4 dense GEMM
(fi_cutlass → fi_cudnn, Test 19 still crashes), and not in the fi_cutlass
allgather (--disable-flashinfer-cutlass-moe-fp4-allgather enabled, crash persists,
only moves the sync point). The culprit must be one of the routed-expert FFN
kernels inside the flashinfer_cutlass MoE forward (gemm_swiglu, scale_and_combine, or the EP dispatch/combine ops), compiled without an SM121
tile fix.
Concentration-dependent: n=1 sometimes doesn’t hit the fault even after minutes of
sustained decode (Test 13 ran ~2 minutes at 62 tok/s before worker-2 died); n≥4
with thinking-heavy prompts kills a rank within seconds, before the first content
token.
What this means for anyone trying flashinfer_cutlass MoE on GB10: unusable at
EP>1 on scitrera/dgx-spark-sglang:0.5.10. A separate sglang GitHub issue with the
stack trace + repro steps + exhausted-diagnostic-switch list is in preparation.
Finding 3 — The bench-harness repetition guard is essential, but has an n=8 blind spot
The bench harness in this matrix ran a streaming repetition guard for the first
time (ngram-flood / suffix-loop / stagnation detection, thresholds: 4-gram,
max-count 8 + ratio 14%, suffix window 600 chars). At n=4 it reliably fired on
every eager-mode row and marked those runs as status=repetition instead of
letting them pass as “STABLE 144 tok/s”.
At n=8 the guard slipped through in 3 of 4 cases (Tests 29, 32, 35). Those runs
get recorded as bogus “8/8 done” at 143–147 tok/s. The post-hoc garbage fingerprint
is trivial to detect (all 8 requests with identical tokens_per_sec, think_tokens_est=768, content_tokens_est=None, output_tokens=3072, finish_reason=length — two decimal places identical), but it slips past the
guard during the live run.
Suspected cause: the guard runs per-request inside an async generator wrapper;
with n=8 concurrent streams a race pattern may emerge where all 8 streams hit max_tokens simultaneously before the suffix-loop detector has accumulated enough
output to trigger. Not yet verified — being tested separately.
Recommendation for any bench infrastructure on this cluster: keep the guard,
and also run a post-hoc garbage-fingerprint check that looks for uniform stats
sets (len(set(all tps values)) == 1 and all content_tokens == None → garbage
regardless of status).
Recommended EP=4 production config
If you need EP=4 on this cluster (e.g. because you want less VRAM per GPU to
co-locate a second model on the same 4 nodes):
# roles/k8s_dgx/model_profiles/nvidia-qwen3.5-397b-a17b-nvfp4.yml
moe_runner_backend: "triton" # NVFP4 → falls through cutlass_moe_fp4
attention_backend: "flashinfer"
fp4_gemm_backend: "flashinfer_cutlass"
disable_cuda_graph: false # MUST be false
disable_piecewise_cuda_graph: false # piecewise ON adds ~2% at n=8
cuda_graph_max_bs: 8 # bump higher if you expect bs > 8!
ep_size: 4
tp_size: 4
That’s Test 3 / 98.5 tok/s n=8 peak, the winner of the EP=4 matrix. Warning about cuda_graph_max_bs=8: with that value, every decode step with
batch > 8 falls back to eager mode — and that’s the garbage regime. If your max_running_requests > 8 (default: 32), either bump cuda_graph_max_bs up to
your max expected batch, or cap max_running_requests at ≤8.
Runner-up is Test 33 (cutlass-direct MoE / fi attn / fi_cudnn fp4 /
graphs + piecewise on) at 95.2 tok/s, in case the cutlass-direct path is preferred
for operational reasons.
For the main post: EP=1 remains the performance winner
On this hardware setup EP=1 is still faster than EP=4 (102.0 vs 98.5 tok/s @ n=8).
EP=4 is viable on Qwen3.5-397B-NVFP4 but not necessary if the model runs alone
on 4 nodes and VRAM fits. The main value of this EP=4 matrix is:
Documented proof that EP=4 works on this 397B on GB10 with scitrera/dgx-spark-sglang:0.5.10 (16 stable rows across two different MoE
backends).
Clear SM121 fault in flashinfer_cutlass MoE — upstream GitHub issue in
preparation with full stack trace + repro steps.
Mechanistic explanation for the eager-mode garbage cluster: cutlass_moe_fp4 combine path + eager mode = broken output.
Full 36-row TESTLOG with per-test Loki-verified stream content and per-crash stack
traces:
Following up on the RoCE and EP-tuning work above, we ran a full configuration matrix on GLM-4.7-NVFP4 (358B total / ~58B active, 160 experts, NVFP4 quantized) — a significantly larger model than Qwen3.5-397B-A17B — on the same 4× DGX Spark cluster.
Setup
Same cluster as before: 4× DGX Spark (GB10/SM121, 128 GB unified memory each), K3s, RoCE over SR-IOV VFs on the QSFP mesh, SGLang v0.5.10. The model runs at TP=4 EP=1 with modelopt_fp4 quantization and fp8_e4m3 KV cache (~54 GB/GPU).
We tested 38 configurations across three MoE runners (triton, flashinfer_cutlass, cutlass-direct), two attention backends (flashinfer, triton), two FP4 GEMM backends (flashinfer_cutlass, flashinfer_cudnn), CUDA graph modes, and — new in this round — MTP (Multi-Token Prediction) via SGLang’s built-in NEXTN speculative decoding.
Key result: MTP delivers a massive throughput boost
GLM-4.7-NVFP4 ships with a single-layer MTP head, making it a natural candidate for SGLang’s NEXTN speculative algorithm. Here’s the before/after:
Config
n=1 tok/s
n=4 tok/s
n=8 tok/s
n=1 TTFT
Best without MTP (cutlass MoE + fi attn + CG on)
14.5
40.6
60.0
0.63 s
With MTP — triton MoE (Test 37)
24.4
54.6
77.6
0.70 s
With MTP — cutlass MoE (Test 38)
22.8
53.2
81.1
0.71 s
MTP speedup
+68%
+35%
+35%
~same
At n=1 concurrency, MTP boosts single-request throughput from 14.5 to 24.4 tok/s — a 68% improvement with essentially no TTFT penalty (0.70 s vs 0.63 s). At n=8, aggregate throughput jumps from 60.0 to 81.1 tok/s (+35%).
The MTP configuration: --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-num-draft-tokens 4. No separate draft model needed — the MTP head is part of the model itself.
Full matrix findings (38 tests)
Out of 38 configurations, 17 passed all concurrency levels (n=1, n=4, n=8 with 0 failed requests). The three failure modes were fully diagnosed:
Piecewise CUDA graphs are broken for NVFP4 models (0/12 stable). SGLang’s piecewise graph capture runs torch.compile over the forward, which traces into FlashInfer’s fp4_quantize. This function has no fake-tensor (meta) implementation, so Dynamo fails with "Cannot access data pointer of FakeTensor". Workaround: --disable-piecewise-cuda-graph (mandatory for all NVFP4 models on SM121).
SM121 JIT architecture mismatch in SGLang’s kvcache kernel. SGLang’s TVM-FFI JIT compiles with -gencode=arch=compute_121,code=sm_121 instead of the correct sm_120f family target. The resulting SASS contains instructions GB10 cannot decode → cudaErrorIllegalInstruction during CUDA graph capture. This affects any kv_cache_dtype=fp8_e4m3 config with CUDA graphs enabled. We traced it to sglang/jit_kernel/utils.py:_init_jit_cuda_arch_once() using torch.cuda.get_device_capability() without the family f-suffix.
flashinfer_cutlass MoE + flashinfer attention interaction causes cudaErrorIllegalInstruction at decode time on SM121. Switching to triton attention with the same MoE runner resolves it.
Addendum: MTP speculative decoding — up to +86% single-request throughput
Enabled Multi-Token Prediction (MTP) via SGLang’s built-in NEXTN speculative algorithm on both models. Both nvidia/Qwen3.5-397B-A17B-NVFP4 and nvidia/GLM-4.7-NVFP4 ship with a single-layer MTP head, so no separate draft model is needed — the model speculates from its own weights.
Qwen3.5-397B-A17B-NVFP4 — new winner: 40 tok/s at n=1
Config
n=1 tok/s
n=4 tok/s
n=8 tok/s
Previous best (Test 28, no MTP)
21.5
67.8
102.0
MTP — cutlass MoE (Test 38)
40.0
84.3
110.9
MTP — triton MoE (Test 37)
35.3
80.6
106.1
Speedup (Test 38 vs 28)
+86%
+24%
+9%
40 tok/s on a single request to a 397B-parameter model across 4× DGX Spark — nearly double the non-MTP baseline. The speedup is largest at low concurrency where the GPU has spare compute for draft verification; at n=8 the gain narrows to +9% as batching already saturates the pipeline.
Interesting: with MTP enabled, cutlass-direct MoE pulls ahead of triton MoE by ~13% at n=1 (40.0 vs 35.3). Without MTP they were within noise. The speculative path’s tighter decode loop seems to favor the lower-overhead dispatch.
GLM-4.7-NVFP4 (358B/58B-active) — same pattern
We also ran the full matrix on nvidia/GLM-4.7-NVFP4 (358B total, ~58B active, 160 experts, sigmoid routing). Same cluster, same RoCE setup.
Config
n=1 tok/s
n=4 tok/s
n=8 tok/s
Best without MTP (Test 31)
14.5
40.6
60.0
MTP — cutlass MoE (Test 38)
22.8
53.2
81.1
MTP — triton MoE (Test 37)
24.4
54.6
77.6
Speedup (best MTP vs non-MTP)
+68%
+35%
+35%
GLM-4.7 sees an even bigger relative gain at n=8 (+35% vs Qwen3.5’s +9%) — likely because GLM-4.7’s ~58B active params leave more GPU headroom for the draft tokens to be useful under batching.
What it took to enable MTP
For Qwen3.5 specifically, SGLang v0.5.10 requires two additional settings because the model uses hybrid attention (15 full GQA layers + 45 linear attention layers), which SGLang classifies under its “mamba” scheduler path:
Without these, SGLang refuses to start with: ValueError: Speculative decoding for Qwen3_5MoeForConditionalGeneration is not compatible with radix cache when using --mamba-scheduler-strategy no_buffer. GLM-4.7 does not need these (standard GQA throughout).
Memory was reduced from 0.80 to 0.75 (Qwen3.5) / 0.60 (GLM-4.7) to leave KV cache headroom for the draft tokens.
Combined gains: EP=4 socket baseline → EP=1 RoCE + MTP
Qwen3.5-397B
GLM-4.7
EP=4 + socket (old)
20.8 tok/s (n=4)
20.8 tok/s (n=4)
EP=1 + RoCE (no MTP)
67.8 tok/s (n=4)
40.6 tok/s (n=4)
EP=1 + RoCE + MTP
84.3 tok/s (n=4)
54.6 tok/s (n=4)
Total speedup
4.1×
2.6×
Three orthogonal wins stacking: RoCE transport (~2×), EP topology (EP=1 avoids dispatch overhead), and MTP speculative decoding (+24–35%). No hardware changes, no model changes — pure software/config.