Parallel training with 4 cards 4090 cannot be performed on AMD 5975WX, stuck at the beginning

I use a CPU of AMD 5975WX, and four 4090 graphics cards. cuda version is cuda12, pytoch version is 2.0


I noticed that the last call stack is on cuda

See attachment for bug report,My code is also in the attachment
ex002_DataParallel.py (6.4 KB)

nvidia-bug-report.log.gz (1.2 MB)

(1) Post text as text, not as images.
(2) Cut and paste relevant error messages instead of attaching some giant log file.

Out of interest: What kind of power supply is used in this system? 3200 Watts?

cudalog.rtf (14.7 KB)
ok, here is the wrong text report。

Some information I have learned is that there is no such problem with intel CPUs. The solution given in this Issue is to turn off the IOMMU of the motherboard, but this option is what I need, so I want to know if there are other solutions

I use two 1600W power supplies. A power supply is connected to 2 graphics cards, and the motherboard.
Another power supply is connected to the remaining 2 graphics cards and other devices

As far as I understand the linked thread, the working hypothesis there is that the issue is due to an incompatibility between NVIDIA’s NCCL and AMD’s IOMMU. That is outside my area of expertise, and frankly, appears to have nothing to do with CUDA.

Off-hand I do not know of a sub-forum for NCCL (in fact I did not know of NCCL’s existence until just now). If there is more corroborating evidence strengthening the hypothesis, that might lead to a classical “it’s the other vendor’s fault” finger-pointing exercise.

Consider filing a bug with NVIDIA.

The motherboard I’m using is Supermicro’s M12SWA-TF, I’m sure the BIOS has been updated to the latest from the official website, and the ASC has been turned off. But my situation is still the same, and I am not sure whether the 4090 supports P2P, because I see that the link you shared is running simpleP2P, and I have run simpleP2P
deviceQuery
p2pBandwidthLatencyTest. The result is in the attachment


IOMMU_enable_ACS_disable.txt (16.0 KB)
IOMMU_disable_ACS_enable.txt (16.1 KB)
IOMMU_disable_ACS_disable.txt (16.1 KB)

The failures in your attachments indicate a problem with the motherboard. You should address this with Supermicro.

I don’t have this problem with 4 A100s, only 4090s currently have this problem

The platform is still the original platform, I only changed the graphics card

please help me

I have communicated with Supermicro technical staff, they recognized the problem of ACS, but after checking my settings, they confirmed that ASC has been turned off, and they asked me to take 4 A100s for experiment, it is indeed possible to run p2psimple, but when I change Back to 4090, the problem remains
please help me

FWIW, I have 2 4090s and I’m having same problems training deep learning models on a system that worked fine with 2 3090s.
I’ve tried all the bios settings (which I didn’t need with 3090s)
I’ve tried NCCL_P2P_DISABLE=1 that sometimes works.
I also get a very hard lockup training with Tensorflow 2.5 that requires an unplug to reboot the system.
I believe this is a Nvidia driver/hardware issue for 4090
Threadripper 3970X, ASUS motherboard.

Its a bummer. About to get my WRX80 WS next week with 2 4090s and 5975WX. Why is this issue not widespread? Aren’t workstation builders not seeing this issue when assembling/selling AMD WS with multiple 4090s?

Is turning amd_iommu off the solution in some configurations?

iommu was one of the bios settings I tried and it didn’t help.