I have 1x NVIDIA DGX Spark 4TB (“spark-00d2”) and 3x MSI EdgeXpert 4TB, all connected via a MikroTik CRS804-4DDQ switch with 400G→2×200G breakout cables.
spark-00d2 consistently shows ~16 GB/s busbw in NCCL all_gather, while all other units reach ~23 GB/s. The 3 healthy nodes score 22.84 GB/s together, but adding spark-00d2 drags the 4-node result down to 17.07 GB/s. (Ran NCCL comm test → NCCL for Two Sparks | DGX Spark )
Pair
busbw (GB/s)
Healthy ↔ Healthy
22.83 ~ 22.87
spark-00d2 ↔ Any node
15.81 ~ 15.88
3-node (without spark-00d2)
22.84
4-node (with spark-00d2)
17.07
What I’ve Ruled Out
Swapped cables between ports — same result
Moved spark-00d2 to a different switch port — same result
Factory reset all 4 units — same result
PCIe: all nodes Speed 32GT/s, Width x4
Link speed: 200Gbps on all interfaces
Questions
Is there a diagnostic tool to characterize internal memory/interconnect bandwidth on DGX Spark? Has anyone seen similar per-unit variation? Or am I missing something in my configuration?
Are they on the same kernel version?
I’ve noticed performance regression too, getting 16 GB/s busbw, was getting 24 GB/s before… I wonder if it’s related to 6.17 kernel upgrade.
If the NVIDIA people here don’t respond, I’m considering contacting NVIDIA technical support directly. In the previous version, which is 6.14, I was able to get 200gbps when I directly connected the NVIDIA FE and MSI, but after update to 6.17 directly connected 2 models also show poor performance. but I’m not sure if this is because of the 6.17 kernel update. This is because this issue only occurs on the NVIDIA FE model, and not on the other three MSI models. As far as I know, regardless of which variant is used, NVIDIA GB10 is used… I don’t understand why this issue only occurs on the NVIDIA models. For reference, all MSI models have been factory reset using NVIDIA Recovery media.
Oh, really? I thought it was just my device, but if other people are experiencing this issue, maybe this can be fixed. As far as I remember, this issue didn’t exist with kernel 6.14. I hope the NVIDIA folks can fix it.
I thought there was something wrong with my device, so I spent almost two days here trying to fix it. But it didn’t work.
I also ran the original MiniMax M2.5 version with vllm in 4 clusters and monitored the bandwidth consumption on the CRS804 switch in real time. Fortunately, it only consumed a few tens of Gbps during inference and did not reach the maximum link speed. However, this issue appears to need to be addressed.
I can report the same regression on my side as well. NCCL went down from 22 to 17, but whats even more concerning and is that TCP/scp file transfer went down from 700+MB/s over any of the ConnectX-7 interfaces to some strange behavior from 15-240MB/s.
Spent some serious time trying to get back to normal, including kernel downgrade to 6.14.X with no luck. And this is for both NCCL and TCP.
I think Nvidia should definitely spend some time investigating this asap.
I agree. This is a problem that is far below the maximum speed that the device is capable of according to its specifications, so I think NVIDIA needs to address it quickly.
@Balaxxe if you suspect the SoC firmware is the issue downgrade to the previous version using fwupdmgr downgrade <DEVICEID>
This is output with the latest SoC firmware and previous one:
lsaco@spark1:~$ fwupdmgr get-releases
0. Cancel
1. 8c948e1db381648c8893897e4d09b7b153309991 (Embedded Controller)
2. 0681fd3882fb4fdca996e412ec249365f6e85838 (UEFI Device Firmware)
3. a6c6b7f79c96a1cc84d9612d804675e0c3d879c4 (UEFI Device Firmware)
Choose device [0-3]: 2
NVIDIA NVIDIA_DGX_Spark
│
├─DGX Spark SoC FW System Update:
│ New version: 0x02009418
│ Remote ID: lvfs
│ Release ID: 135015
│ Summary: DGX Spark SoC Firmware Update
│ License: Proprietary
│ Size: 30.4 MB
│ Created: 2026-01-13
│ Urgency: High
│ Vendor: NVIDIA
│ Duration: 30 seconds
│ Release Flags: • Trusted metadata
│ Description:
│ This update improves the performance and stability of the System-on-Chip Firmware including UEFI and GPU in DGX Spark
│ Checksum: 3313ea36efb7fead10e0429b530b8599cc5ec7f35c2945ac6be7a4ce21242313
│
├─DGX Spark SoC FW System Update:
│ New version: 0x02009009
│ Remote ID: lvfs
│ Release ID: 130971
│ Summary: DGX Spark SoC Firmware Update
│ License: Proprietary
│ Size: 30.4 MB
│ Created: 2025-10-24
│ Urgency: High
│ Tested by H3C:
│ Tested: 2025-11-25
│ Distribution: ubuntu 24.04
│ Old version: 0x02008433
│ Version[fwupd]: 1.9.31
│ Vendor: NVIDIA
│ Duration: 5 minutes
│ Release Flags: • Trusted metadata
│ • Is downgrade
│ Description:
│ This update improves the performance and stability of the System-on-Chip Firmware including UEFI and GPU in DGX Spark
│ Checksum: 71a880b7565f6ab4e55874acf0fc962caa2368c843bb5943376ca46d733ee9c8
On the FE units the lvfs-testing is not enabled:
elsaco@spark1:~$ cat /etc/fwupd/remotes.d/lvfs-testing.conf
[fwupd Remote]
# this remote provides metadata and firmware marked as 'testing' from the LVFS
Enabled=false
Title=Linux Vendor Firmware Service (testing)
MetadataURI=https://cdn.fwupd.org/downloads/firmware-testing.xml.zst
ReportURI=https://fwupd.org/lvfs/firmware/report
OrderBefore=lvfs
AutomaticReports=false
ApprovalRequired=false