One of Four DGX Sparks Shows ~35% Lower NCCL Bandwidth — Can't Figure Out Why

I definitely see a drop. I used to hit 45 GB/s bidirectional and now it’s more like 30-35 (bidirectional). That’s consistent with comments of going from 24 to 16 GB/s unidirectional.

Interestingly, for really small transfers like 2 or 4MB, the new firmware is occasionally faster. But for 128MB+ it stalls and stops getting faster as the batch size increases.

1 Like

Yep, seeing significant slowdowns in TCP/IP on my system too. Transferring the same size docker container from one node to another now takes almost 2x time.

@NVES have you been able to reproduce the problem in your labs? It seems that if I use ib_write_bw, and if I assign IP address to both interfaces in software and make sure MTU is set to a large number that I can get back to near wire speed. I don’t know if the new firmware accidentally dropped the MTU down to 1500 or if something about the internal PCI express bridging got disrupted with the new power management feature. I have not been able to get my NCCCL to work at the same speed it was before. I’m getting that 25 to 30% drop in performance when running the same code, hoping that NCCL will magically detect the multiple interfaces and try to use them

Is it possible to have the software driver powered the interface down when the interface is disabled and power it ready to go when the interface is enabled I definitely can see how people would like to power down their connect. X7 interface if it is unused to reduce heat and power consumption.

I did not have time to play with it much so I don’t know if the MTU alone might explain some of the problems that have been reported. I never really checked on the old firmware what my MTU was set to. The current release seems to have it at 1500.

same, I regret updating my OEM to be on parity with FE.

think in the future I’m just going to let you guys be the guinea pigs lol.

1 Like

I have MTU set to 9000 and still seeing slowdowns in both NCCL and TCP/IP file transfers.

NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0
NCCL_IB_GID_INDEX=2
mpirun -np 2 -H 10.0.0.11:1,10.0.0.10:1
–mca plm_rsh_agent “ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no”
–mca btl_tcp_if_include enp1s0f0np0
–mca oob_tcp_if_include enp1s0f0np0
-x LD_LIBRARY_PATH=$LD_LIBRARY_PATH
-x NCCL_SOCKET_IFNAME=enp1s0f0np0
-x NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0
-x NCCL_IB_GID_INDEX=2
$HOME/nccl-tests/build/all_gather_perf -b 128M -e 2G -f 2

What happens if you run this? I was able to get this to hit faster speeds, but in my own code, which I have to debug more carefully, I got the 25-30% drop in performance compared to when I tested it with the old DGX OS.

For what it’s worth, I booted to the old 6.14 kernel and it didn’t change anything, so it’s likely due to the new driver and power management feature (?).

I am also running NCCL 2.28.3 in case that matters. Haven’t rebuilt from the latest.

When I first posted on the forum, I had already tried NCCL version 2.29.3-1, but nothing changed.

Same here.

Thank you for being so patient. Our engineering team has identified the root cause, and a firmware update to address the issue will be released soon.

6 Likes

Thank you, I hope the firmware will be released soon!

Do you have any plans for when the improved firmware will be released? Of course, there are other issues to address, but this bandwidth issue seems to be causing a lot of pain. For example, copying files to other nodes takes too much time.

3 Likes

We just released a software update today to address the CX7 bandwidth issue. This update will only apply to Founder’s Edition Sparks

4 Likes

I ran tests, and the speed is back to normal! Thanks!

1 Like

Since I only have one NVIDIA FE and three MSI Variants, I guess I’ll have to wait until MSI provides new firmware.

I have 2 founders edition, with the latest sw updates on both. Still getting ~13 Gbps. How do I get this specific update with the fix (March 12 FE update)? apt dist-upgrade and fwupdmgr upgrade both show nothing available.

What firmware version does ethtool -i for the mlx nics show you?

Just updated my Asus, there were also new firmware versions in that update (one for the power supply as it seems), but for the mellanox it still seems to be 28.45.4028.

But may be the mellanox fw wasn’t the problem…

Did you cut the power after the update? As in “disconnect the power supply” after the update.

As these beasts have also some kind of “smart” controllers of their own in the power supplies and updates seems to require a disconnect from the power outlet for a few minutes to restart with the new firmware.

I can confirm the same issue on 3 different Asus GX10. After the firmware updates yesterday performance is at busbw ~15.6 GBs / algbw ~31GB/s. I unplugged power and devices for ~5 mins, and same result.

1 Like

I reverted the firmware on the GX10 back to 0x03000004

busbw is now 24.3 GB/s and algbw is 48.6. So there is a performance regression still present in the current firmware (0x03000005) from Asus for the GX10.

I plan to remain on the prior version until it is resolved.

driver: mlx5_core
version: 6.17.0-1008-nvidia
firmware-version: 28.45.4028 (NVD0000000087)

AFAIK, the firmware update was only released for FE units.