One of Four DGX Sparks Shows ~35% Lower NCCL Bandwidth — Can't Figure Out Why

I have 1x NVIDIA DGX Spark 4TB (“spark-00d2”) and 3x MSI EdgeXpert 4TB, all connected via a MikroTik CRS804-4DDQ switch with 400G→2×200G breakout cables.

spark-00d2 consistently shows ~16 GB/s busbw in NCCL all_gather, while all other units reach ~23 GB/s. The 3 healthy nodes score 22.84 GB/s together, but adding spark-00d2 drags the 4-node result down to 17.07 GB/s. (Ran NCCL comm test → NCCL for Two Sparks | DGX Spark )

Pair busbw (GB/s)
Healthy ↔ Healthy 22.83 ~ 22.87
spark-00d2 ↔ Any node 15.81 ~ 15.88
3-node (without spark-00d2) 22.84
4-node (with spark-00d2) 17.07

What I’ve Ruled Out

  • Swapped cables between ports — same result

  • Moved spark-00d2 to a different switch port — same result

  • Factory reset all 4 units — same result

  • PCIe: all nodes Speed 32GT/s, Width x4

  • Link speed: 200Gbps on all interfaces

Questions

Is there a diagnostic tool to characterize internal memory/interconnect bandwidth on DGX Spark? Has anyone seen similar per-unit variation? Or am I missing something in my configuration?

Any help would be greatly appreciated.

Are they on the same kernel version?
I’ve noticed performance regression too, getting 16 GB/s busbw, was getting 24 GB/s before… I wonder if it’s related to 6.17 kernel upgrade.

They all are on the same version, 6.17.

If the NVIDIA people here don’t respond, I’m considering contacting NVIDIA technical support directly. In the previous version, which is 6.14, I was able to get 200gbps when I directly connected the NVIDIA FE and MSI, but after update to 6.17 directly connected 2 models also show poor performance. but I’m not sure if this is because of the 6.17 kernel update. This is because this issue only occurs on the NVIDIA FE model, and not on the other three MSI models. As far as I know, regardless of which variant is used, NVIDIA GB10 is used… I don’t understand why this issue only occurs on the NVIDIA models. For reference, all MSI models have been factory reset using NVIDIA Recovery media.

I have two FE models and seeing the same performance degradation as you.
@johnny_nv - FYI.

I wonder if it’s related to the latest powersaving feature…

Oh, really? I thought it was just my device, but if other people are experiencing this issue, maybe this can be fixed. As far as I remember, this issue didn’t exist with kernel 6.14. I hope the NVIDIA folks can fix it.

I thought there was something wrong with my device, so I spent almost two days here trying to fix it. But it didn’t work.

No, definitely not just you. Good thing is that it doesn’t seem to affect vLLM cluster performance much, but still…

I also ran the original MiniMax M2.5 version with vllm in 4 clusters and monitored the bandwidth consumption on the CRS804 switch in real time. Fortunately, it only consumed a few tens of Gbps during inference and did not reach the maximum link speed. However, this issue appears to need to be addressed.

Thank you for flagging this. Team is investigating.

2 Likes

Yeah, for vLLM latency is more important than bandwidth, although it seems to be affected as well, but not as much.

1 Like

Please let me know when this issue has been resolved internally or an update has been released.

Same regression here after updating my MSI variants with new firmware.

I don’t know why you’re having trouble with the MSI variant and I’m having trouble with the NVIDIA FE. strange

I can report the same regression on my side as well. NCCL went down from 22 to 17, but whats even more concerning and is that TCP/scp file transfer went down from 700+MB/s over any of the ConnectX-7 interfaces to some strange behavior from 15-240MB/s.

Spent some serious time trying to get back to normal, including kernel downgrade to 6.14.X with no luck. And this is for both NCCL and TCP.

I think Nvidia should definitely spend some time investigating this asap.

I have 2xFEs in my lab

1 Like

I agree. This is a problem that is far below the maximum speed that the device is capable of according to its specifications, so I think NVIDIA needs to address it quickly.

Because the SoC Firmware updated (10500 → 10600).

It was fine before I did this.

The Embedded Controller update and the USB power delivery firmware update most likely did not contribute.

I had to manually install pre-release firmware for my variant (buts it’s what FE got automatically).

If you want your MSI variants to match, do the following:

sudo fwupdmgr enable-remote lvfs-testing
sudo fwupdmgr refresh
sudo fwupdmgr update

@Balaxxe if you suspect the SoC firmware is the issue downgrade to the previous version using fwupdmgr downgrade <DEVICEID>

This is output with the latest SoC firmware and previous one:

lsaco@spark1:~$ fwupdmgr get-releases
0.      Cancel
1.      8c948e1db381648c8893897e4d09b7b153309991 (Embedded Controller)
2.      0681fd3882fb4fdca996e412ec249365f6e85838 (UEFI Device Firmware)
3.      a6c6b7f79c96a1cc84d9612d804675e0c3d879c4 (UEFI Device Firmware)
Choose device [0-3]: 2
NVIDIA NVIDIA_DGX_Spark
│
├─DGX Spark SoC FW System Update:
│     New version:        0x02009418
│     Remote ID:          lvfs
│     Release ID:         135015
│     Summary:            DGX Spark SoC Firmware Update
│     License:            Proprietary
│     Size:               30.4 MB
│     Created:            2026-01-13
│     Urgency:            High
│     Vendor:             NVIDIA
│     Duration:           30 seconds
│     Release Flags:      • Trusted metadata
│     Description:
│     This update improves the performance and stability of the System-on-Chip Firmware including UEFI and GPU in DGX Spark
│     Checksum:           3313ea36efb7fead10e0429b530b8599cc5ec7f35c2945ac6be7a4ce21242313
│
├─DGX Spark SoC FW System Update:
│     New version:        0x02009009
│     Remote ID:          lvfs
│     Release ID:         130971
│     Summary:            DGX Spark SoC Firmware Update
│     License:            Proprietary
│     Size:               30.4 MB
│     Created:            2025-10-24
│     Urgency:            High
│     Tested by H3C:
│       Tested:           2025-11-25
│       Distribution:     ubuntu 24.04
│       Old version:      0x02008433
│       Version[fwupd]:   1.9.31
│     Vendor:             NVIDIA
│     Duration:           5 minutes
│     Release Flags:      • Trusted metadata
│                         • Is downgrade
│     Description:
│     This update improves the performance and stability of the System-on-Chip Firmware including UEFI and GPU in DGX Spark
│     Checksum:           71a880b7565f6ab4e55874acf0fc962caa2368c843bb5943376ca46d733ee9c8

On the FE units the lvfs-testing is not enabled:

elsaco@spark1:~$ cat /etc/fwupd/remotes.d/lvfs-testing.conf
[fwupd Remote]

# this remote provides metadata and firmware marked as 'testing' from the LVFS
Enabled=false
Title=Linux Vendor Firmware Service (testing)
MetadataURI=https://cdn.fwupd.org/downloads/firmware-testing.xml.zst
ReportURI=https://fwupd.org/lvfs/firmware/report
OrderBefore=lvfs
AutomaticReports=false
ApprovalRequired=false

Enabling it might be too risky!

It’s not enabled by default on the variants either - and I agree, if you don’t know what it’s doing - don’t do it.

No need for me downgrading as my inferencing has not been effected.

Mine MSI variants are all 10500. But NVIDIA has different firmware version which is 0x507.

It would be good if NVIDIA worked with OEM manufacturers to unify their firmware release schedules.

I think they do as this was released 5 days ago (for MSI) which is around the time it was first noted by an FE user.

The naming conventions are different.