One of Four DGX Sparks Shows ~35% Lower NCCL Bandwidth — Can't Figure Out Why

alan.dang · February 17, 2026, 10:54pm

I definitely see a drop. I used to hit 45 GB/s bidirectional and now it’s more like 30-35 (bidirectional). That’s consistent with comments of going from 24 to 16 GB/s unidirectional.

Interestingly, for really small transfers like 2 or 4MB, the new firmware is occasionally faster. But for 128MB+ it stalls and stops getting faster as the batch size increases.

eugr · February 18, 2026, 5:30pm

Yep, seeing significant slowdowns in TCP/IP on my system too. Transferring the same size docker container from one node to another now takes almost 2x time.

alan.dang · February 18, 2026, 5:46pm

@NVES have you been able to reproduce the problem in your labs? It seems that if I use ib_write_bw, and if I assign IP address to both interfaces in software and make sure MTU is set to a large number that I can get back to near wire speed. I don’t know if the new firmware accidentally dropped the MTU down to 1500 or if something about the internal PCI express bridging got disrupted with the new power management feature. I have not been able to get my NCCCL to work at the same speed it was before. I’m getting that 25 to 30% drop in performance when running the same code, hoping that NCCL will magically detect the multiple interfaces and try to use them

Is it possible to have the software driver powered the interface down when the interface is disabled and power it ready to go when the interface is enabled I definitely can see how people would like to power down their connect. X7 interface if it is unused to reduce heat and power consumption.

I did not have time to play with it much so I don’t know if the MTU alone might explain some of the problems that have been reported. I never really checked on the old firmware what my MTU was set to. The current release seems to have it at 1500.

Balaxxe · February 18, 2026, 5:56pm

same, I regret updating my OEM to be on parity with FE.

think in the future I’m just going to let you guys be the guinea pigs lol.

eugr · February 18, 2026, 5:58pm

I have MTU set to 9000 and still seeing slowdowns in both NCCL and TCP/IP file transfers.

alan.dang · February 18, 2026, 6:26pm

NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0
NCCL_IB_GID_INDEX=2
mpirun -np 2 -H 10.0.0.11:1,10.0.0.10:1
–mca plm_rsh_agent “ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no”
–mca btl_tcp_if_include enp1s0f0np0
–mca oob_tcp_if_include enp1s0f0np0
-x LD_LIBRARY_PATH=$LD_LIBRARY_PATH
-x NCCL_SOCKET_IFNAME=enp1s0f0np0
-x NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0
-x NCCL_IB_GID_INDEX=2
$HOME/nccl-tests/build/all_gather_perf -b 128M -e 2G -f 2

What happens if you run this? I was able to get this to hit faster speeds, but in my own code, which I have to debug more carefully, I got the 25-30% drop in performance compared to when I tested it with the old DGX OS.

For what it’s worth, I booted to the old 6.14 kernel and it didn’t change anything, so it’s likely due to the new driver and power management feature (?).

I am also running NCCL 2.28.3 in case that matters. Haven’t rebuilt from the latest.

s0ne · February 18, 2026, 7:47pm

When I first posted on the forum, I had already tried NCCL version 2.29.3-1, but nothing changed.

eugr · February 18, 2026, 8:47pm

Same here.

NVES · February 21, 2026, 12:46am

Thank you for being so patient. Our engineering team has identified the root cause, and a firmware update to address the issue will be released soon.

s0ne · February 21, 2026, 2:01am

Thank you, I hope the firmware will be released soon!

s0ne · February 27, 2026, 1:11am

Do you have any plans for when the improved firmware will be released? Of course, there are other issues to address, but this bandwidth issue seems to be causing a lot of pain. For example, copying files to other nodes takes too much time.

aniculescu · March 12, 2026, 3:31pm

We just released a software update today to address the CX7 bandwidth issue. This update will only apply to Founder’s Edition Sparks

eugr · March 12, 2026, 9:20pm

I ran tests, and the speed is back to normal! Thanks!

s0ne · March 13, 2026, 2:03am

Since I only have one NVIDIA FE and three MSI Variants, I guess I’ll have to wait until MSI provides new firmware.

Ank-Chy · March 14, 2026, 9:29am

I have 2 founders edition, with the latest sw updates on both. Still getting ~13 Gbps. How do I get this specific update with the fix (March 12 FE update)? apt dist-upgrade and fwupdmgr upgrade both show nothing available.

cosinus · March 14, 2026, 4:38pm

What firmware version does ethtool -i for the mlx nics show you?

Just updated my Asus, there were also new firmware versions in that update (one for the power supply as it seems), but for the mellanox it still seems to be 28.45.4028.

But may be the mellanox fw wasn’t the problem…

cosinus · March 14, 2026, 4:44pm

Did you cut the power after the update? As in “disconnect the power supply” after the update.

As these beasts have also some kind of “smart” controllers of their own in the power supplies and updates seems to require a disconnect from the power outlet for a few minutes to restart with the new firmware.

ivaldez · March 14, 2026, 5:57pm

I can confirm the same issue on 3 different Asus GX10. After the firmware updates yesterday performance is at busbw ~15.6 GBs / algbw ~31GB/s. I unplugged power and devices for ~5 mins, and same result.

ivaldez · March 14, 2026, 6:36pm

I reverted the firmware on the GX10 back to 0x03000004

busbw is now 24.3 GB/s and algbw is 48.6. So there is a performance regression still present in the current firmware (0x03000005) from Asus for the GX10.

I plan to remain on the prior version until it is resolved.

eugr · March 14, 2026, 11:19pm

driver: mlx5_core
version: 6.17.0-1008-nvidia
firmware-version: 28.45.4028 (NVD0000000087)

AFAIK, the firmware update was only released for FE units.

Topic		Replies	Views
NCCL Test Bandwidth is only 3GB/s between 2 DGX Spark using QSFP cable DGX Spark / GB10 spark , nics , dgx	9	282	April 19, 2026
NCCL bandwidth capped at 3 GB/s, GPU PCIe topology reports Gen1 x1 on DGX Spark FE DGX Spark / GB10 pcie , kernel , performance , debugging-and-troubleshooting , nics , rdma	5	226	April 14, 2026
ConnectX-7 NIC in DGX Spark DGX Spark / GB10	67	4831	December 2, 2025
DGX Spark NCCL Test: 15GB/s So Slow DGX Spark / GB10	1	204	March 4, 2026
Terrible throughput number between 2 DGX Sparks DGX Spark / GB10	2	325	March 4, 2026
Has anyone tried an alternative Linux distro? DGX Spark / GB10	62	3405	December 28, 2025
Suggested cable to link two Sparks? DGX Spark / GB10	77	6754	December 8, 2025
Latest Update (20Mar 2026) on Nvidia Spark FE caps GPU performance DGX Spark / GB10 performance , gpu	9	500	April 3, 2026
ConnectX-7 NIC's no longer appear DGX Spark / GB10 pcie , nics , lspci	12	396	April 15, 2026
DGX Spark. low fan speed, high temps, device very hot DGX Spark / GB10 kernel , gpu , fan-facts , debugging-and-troubleshooting	43	3670	March 30, 2026

One of Four DGX Sparks Shows ~35% Lower NCCL Bandwidth — Can't Figure Out Why

Related topics