I definitely see a drop. I used to hit 45 GB/s bidirectional and now it’s more like 30-35 (bidirectional). That’s consistent with comments of going from 24 to 16 GB/s unidirectional.
Interestingly, for really small transfers like 2 or 4MB, the new firmware is occasionally faster. But for 128MB+ it stalls and stops getting faster as the batch size increases.
Yep, seeing significant slowdowns in TCP/IP on my system too. Transferring the same size docker container from one node to another now takes almost 2x time.
@NVES have you been able to reproduce the problem in your labs? It seems that if I use ib_write_bw, and if I assign IP address to both interfaces in software and make sure MTU is set to a large number that I can get back to near wire speed. I don’t know if the new firmware accidentally dropped the MTU down to 1500 or if something about the internal PCI express bridging got disrupted with the new power management feature. I have not been able to get my NCCCL to work at the same speed it was before. I’m getting that 25 to 30% drop in performance when running the same code, hoping that NCCL will magically detect the multiple interfaces and try to use them
Is it possible to have the software driver powered the interface down when the interface is disabled and power it ready to go when the interface is enabled I definitely can see how people would like to power down their connect. X7 interface if it is unused to reduce heat and power consumption.
I did not have time to play with it much so I don’t know if the MTU alone might explain some of the problems that have been reported. I never really checked on the old firmware what my MTU was set to. The current release seems to have it at 1500.
What happens if you run this? I was able to get this to hit faster speeds, but in my own code, which I have to debug more carefully, I got the 25-30% drop in performance compared to when I tested it with the old DGX OS.
For what it’s worth, I booted to the old 6.14 kernel and it didn’t change anything, so it’s likely due to the new driver and power management feature (?).
I am also running NCCL 2.28.3 in case that matters. Haven’t rebuilt from the latest.
Do you have any plans for when the improved firmware will be released? Of course, there are other issues to address, but this bandwidth issue seems to be causing a lot of pain. For example, copying files to other nodes takes too much time.
I have 2 founders edition, with the latest sw updates on both. Still getting ~13 Gbps. How do I get this specific update with the fix (March 12 FE update)? apt dist-upgrade and fwupdmgr upgrade both show nothing available.
What firmware version does ethtool -i for the mlx nics show you?
Just updated my Asus, there were also new firmware versions in that update (one for the power supply as it seems), but for the mellanox it still seems to be 28.45.4028.
Did you cut the power after the update? As in “disconnect the power supply” after the update.
As these beasts have also some kind of “smart” controllers of their own in the power supplies and updates seems to require a disconnect from the power outlet for a few minutes to restart with the new firmware.
I can confirm the same issue on 3 different Asus GX10. After the firmware updates yesterday performance is at busbw ~15.6 GBs / algbw ~31GB/s. I unplugged power and devices for ~5 mins, and same result.
I reverted the firmware on the GX10 back to 0x03000004
busbw is now 24.3 GB/s and algbw is 48.6. So there is a performance regression still present in the current firmware (0x03000005) from Asus for the GX10.
I plan to remain on the prior version until it is resolved.