cudalog.rtf (14.7 KB)
ok, here is the wrong text report。
Some information I have learned is that there is no such problem with intel CPUs. The solution given in this Issue is to turn off the IOMMU of the motherboard, but this option is what I need, so I want to know if there are other solutions
I use two 1600W power supplies. A power supply is connected to 2 graphics cards, and the motherboard.
Another power supply is connected to the remaining 2 graphics cards and other devices
As far as I understand the linked thread, the working hypothesis there is that the issue is due to an incompatibility between NVIDIA’s NCCL and AMD’s IOMMU. That is outside my area of expertise, and frankly, appears to have nothing to do with CUDA.
Off-hand I do not know of a sub-forum for NCCL (in fact I did not know of NCCL’s existence until just now). If there is more corroborating evidence strengthening the hypothesis, that might lead to a classical “it’s the other vendor’s fault” finger-pointing exercise.
The motherboard I’m using is Supermicro’s M12SWA-TF, I’m sure the BIOS has been updated to the latest from the official website, and the ASC has been turned off. But my situation is still the same, and I am not sure whether the 4090 supports P2P, because I see that the link you shared is running simpleP2P, and I have run simpleP2P
p2pBandwidthLatencyTest. The result is in the attachment
I have communicated with Supermicro technical staff, they recognized the problem of ACS, but after checking my settings, they confirmed that ASC has been turned off, and they asked me to take 4 A100s for experiment, it is indeed possible to run p2psimple, but when I change Back to 4090, the problem remains
please help me
FWIW, I have 2 4090s and I’m having same problems training deep learning models on a system that worked fine with 2 3090s.
I’ve tried all the bios settings (which I didn’t need with 3090s)
I’ve tried NCCL_P2P_DISABLE=1 that sometimes works.
I also get a very hard lockup training with Tensorflow 2.5 that requires an unplug to reboot the system.
I believe this is a Nvidia driver/hardware issue for 4090
Threadripper 3970X, ASUS motherboard.
Its a bummer. About to get my WRX80 WS next week with 2 4090s and 5975WX. Why is this issue not widespread? Aren’t workstation builders not seeing this issue when assembling/selling AMD WS with multiple 4090s?