One of Four DGX Sparks Shows ~35% Lower NCCL Bandwidth — Can't Figure Out Why

s0ne · February 14, 2026, 10:48am

I have 1x NVIDIA DGX Spark 4TB (“spark-00d2”) and 3x MSI EdgeXpert 4TB, all connected via a MikroTik CRS804-4DDQ switch with 400G→2×200G breakout cables.

spark-00d2 consistently shows ~16 GB/s busbw in NCCL all_gather, while all other units reach ~23 GB/s. The 3 healthy nodes score 22.84 GB/s together, but adding spark-00d2 drags the 4-node result down to 17.07 GB/s. (Ran NCCL comm test → NCCL for Two Sparks | DGX Spark )

Pair	busbw (GB/s)
Healthy ↔ Healthy	22.83 ~ 22.87
spark-00d2 ↔ Any node	15.81 ~ 15.88
3-node (without spark-00d2)	22.84
4-node (with spark-00d2)	17.07

What I’ve Ruled Out

Swapped cables between ports — same result
Moved spark-00d2 to a different switch port — same result
Factory reset all 4 units — same result
PCIe: all nodes Speed 32GT/s, Width x4
Link speed: 200Gbps on all interfaces

Questions

Is there a diagnostic tool to characterize internal memory/interconnect bandwidth on DGX Spark? Has anyone seen similar per-unit variation? Or am I missing something in my configuration?

Any help would be greatly appreciated.

eugr · February 16, 2026, 8:59pm

Are they on the same kernel version?
I’ve noticed performance regression too, getting 16 GB/s busbw, was getting 24 GB/s before… I wonder if it’s related to 6.17 kernel upgrade.

s0ne · February 16, 2026, 9:33pm

They all are on the same version, 6.17.

If the NVIDIA people here don’t respond, I’m considering contacting NVIDIA technical support directly. In the previous version, which is 6.14, I was able to get 200gbps when I directly connected the NVIDIA FE and MSI, but after update to 6.17 directly connected 2 models also show poor performance. but I’m not sure if this is because of the 6.17 kernel update. This is because this issue only occurs on the NVIDIA FE model, and not on the other three MSI models. As far as I know, regardless of which variant is used, NVIDIA GB10 is used… I don’t understand why this issue only occurs on the NVIDIA models. For reference, all MSI models have been factory reset using NVIDIA Recovery media.

eugr · February 16, 2026, 9:40pm

I have two FE models and seeing the same performance degradation as you.
@johnny_nv - FYI.

eugr · February 16, 2026, 9:49pm

I wonder if it’s related to the latest powersaving feature…

s0ne · February 16, 2026, 9:49pm

Oh, really? I thought it was just my device, but if other people are experiencing this issue, maybe this can be fixed. As far as I remember, this issue didn’t exist with kernel 6.14. I hope the NVIDIA folks can fix it.

I thought there was something wrong with my device, so I spent almost two days here trying to fix it. But it didn’t work.

eugr · February 16, 2026, 9:55pm

No, definitely not just you. Good thing is that it doesn’t seem to affect vLLM cluster performance much, but still…

s0ne · February 16, 2026, 10:00pm

I also ran the original MiniMax M2.5 version with vllm in 4 clusters and monitored the bandwidth consumption on the CRS804 switch in real time. Fortunately, it only consumed a few tens of Gbps during inference and did not reach the maximum link speed. However, this issue appears to need to be addressed.

NVES · February 16, 2026, 10:01pm

Thank you for flagging this. Team is investigating.

eugr · February 16, 2026, 10:03pm

Yeah, for vLLM latency is more important than bandwidth, although it seems to be affected as well, but not as much.

s0ne · February 16, 2026, 10:37pm

Please let me know when this issue has been resolved internally or an update has been released.

Balaxxe · February 17, 2026, 3:00am

Same regression here after updating my MSI variants with new firmware.

s0ne · February 17, 2026, 3:06am

I don’t know why you’re having trouble with the MSI variant and I’m having trouble with the NVIDIA FE. strange

kosta · February 17, 2026, 3:14am

I can report the same regression on my side as well. NCCL went down from 22 to 17, but whats even more concerning and is that TCP/scp file transfer went down from 700+MB/s over any of the ConnectX-7 interfaces to some strange behavior from 15-240MB/s.

Spent some serious time trying to get back to normal, including kernel downgrade to 6.14.X with no luck. And this is for both NCCL and TCP.

I think Nvidia should definitely spend some time investigating this asap.

I have 2xFEs in my lab

s0ne · February 17, 2026, 3:21am

I agree. This is a problem that is far below the maximum speed that the device is capable of according to its specifications, so I think NVIDIA needs to address it quickly.

Balaxxe · February 17, 2026, 3:58am

Because the SoC Firmware updated (10500 → 10600).

It was fine before I did this.

The Embedded Controller update and the USB power delivery firmware update most likely did not contribute.

I had to manually install pre-release firmware for my variant (buts it’s what FE got automatically).

If you want your MSI variants to match, do the following:

sudo fwupdmgr enable-remote lvfs-testing
sudo fwupdmgr refresh
sudo fwupdmgr update

elsaco · February 17, 2026, 4:25am

@Balaxxe if you suspect the SoC firmware is the issue downgrade to the previous version using fwupdmgr downgrade <DEVICEID>

This is output with the latest SoC firmware and previous one:

lsaco@spark1:~$ fwupdmgr get-releases
0.      Cancel
1.      8c948e1db381648c8893897e4d09b7b153309991 (Embedded Controller)
2.      0681fd3882fb4fdca996e412ec249365f6e85838 (UEFI Device Firmware)
3.      a6c6b7f79c96a1cc84d9612d804675e0c3d879c4 (UEFI Device Firmware)
Choose device [0-3]: 2
NVIDIA NVIDIA_DGX_Spark
│
├─DGX Spark SoC FW System Update:
│     New version:        0x02009418
│     Remote ID:          lvfs
│     Release ID:         135015
│     Summary:            DGX Spark SoC Firmware Update
│     License:            Proprietary
│     Size:               30.4 MB
│     Created:            2026-01-13
│     Urgency:            High
│     Vendor:             NVIDIA
│     Duration:           30 seconds
│     Release Flags:      • Trusted metadata
│     Description:
│     This update improves the performance and stability of the System-on-Chip Firmware including UEFI and GPU in DGX Spark
│     Checksum:           3313ea36efb7fead10e0429b530b8599cc5ec7f35c2945ac6be7a4ce21242313
│
├─DGX Spark SoC FW System Update:
│     New version:        0x02009009
│     Remote ID:          lvfs
│     Release ID:         130971
│     Summary:            DGX Spark SoC Firmware Update
│     License:            Proprietary
│     Size:               30.4 MB
│     Created:            2025-10-24
│     Urgency:            High
│     Tested by H3C:
│       Tested:           2025-11-25
│       Distribution:     ubuntu 24.04
│       Old version:      0x02008433
│       Version[fwupd]:   1.9.31
│     Vendor:             NVIDIA
│     Duration:           5 minutes
│     Release Flags:      • Trusted metadata
│                         • Is downgrade
│     Description:
│     This update improves the performance and stability of the System-on-Chip Firmware including UEFI and GPU in DGX Spark
│     Checksum:           71a880b7565f6ab4e55874acf0fc962caa2368c843bb5943376ca46d733ee9c8

On the FE units the lvfs-testing is not enabled:

elsaco@spark1:~$ cat /etc/fwupd/remotes.d/lvfs-testing.conf
[fwupd Remote]

# this remote provides metadata and firmware marked as 'testing' from the LVFS
Enabled=false
Title=Linux Vendor Firmware Service (testing)
MetadataURI=https://cdn.fwupd.org/downloads/firmware-testing.xml.zst
ReportURI=https://fwupd.org/lvfs/firmware/report
OrderBefore=lvfs
AutomaticReports=false
ApprovalRequired=false

Enabling it might be too risky!

Balaxxe · February 17, 2026, 4:38am

It’s not enabled by default on the variants either - and I agree, if you don’t know what it’s doing - don’t do it.

No need for me downgrading as my inferencing has not been effected.

s0ne · February 17, 2026, 4:53am

Mine MSI variants are all 10500. But NVIDIA has different firmware version which is 0x507.

It would be good if NVIDIA worked with OEM manufacturers to unify their firmware release schedules.

Balaxxe · February 17, 2026, 5:00am

I think they do as this was released 5 days ago (for MSI) which is around the time it was first noted by an FE user.

The naming conventions are different.

Topic		Replies	Views
NCCL Test Bandwidth is only 3GB/s between 2 DGX Spark using QSFP cable DGX Spark / GB10 spark , nics , dgx	10	382	May 3, 2026
NCCL bandwidth capped at 3 GB/s, GPU PCIe topology reports Gen1 x1 on DGX Spark FE DGX Spark / GB10 pcie , kernel , performance , debugging-and-troubleshooting , nics , rdma	6	294	April 14, 2026
NCCL all_gather Performance Halved on Dual Spark Setup (ConnectX-7) After MSI Firmware Update - Solved via Downgrade DGX Spark / GB10	5	212	April 28, 2026
ConnectX-7 NIC in DGX Spark DGX Spark / GB10	67	5394	December 2, 2025
DGX Spark NCCL Test: 15GB/s So Slow DGX Spark / GB10	1	232	March 4, 2026
Terrible throughput number between 2 DGX Sparks DGX Spark / GB10	2	395	March 4, 2026
Has anyone tried an alternative Linux distro? DGX Spark / GB10	62	3683	December 28, 2025
Suggested cable to link two Sparks? DGX Spark / GB10	77	7435	December 8, 2025
Latest Update (20Mar 2026) on Nvidia Spark FE caps GPU performance DGX Spark / GB10 performance , gpu	9	587	April 3, 2026
ConnectX-7 NIC's no longer appear DGX Spark / GB10 pcie , nics , lspci	12	449	April 15, 2026

One of Four DGX Sparks Shows ~35% Lower NCCL Bandwidth — Can't Figure Out Why

What I’ve Ruled Out

Questions

Related topics