我用的CPU是AMD 5975WX,显卡是4块4090。cuda版本为cuda12,pytoch版本为2.0

我用的CPU是AMD 5975WX,显卡是4块4090。cuda版本为cuda12,pytoch版本为2.0


我注意到最后一个调用堆栈在cuda上
image
错误报告见附件,我的代码也在附件
ex002_DataParallel.py (6.4 KB)
nvidia-bug-report.log.gz (1.2 MB)
cudalog.rtf (14.7 KB)

Some information I have learned is that there is no such problem with intel CPUs. The solution given in this Issue is to turn off the IOMMU of the motherboard, but this option is what I need, so I want to know if there are other solutions

I’m not sure if there is a compatibility issue with AMD’s CPU and 4090

very eager to get help

This is not a cpu issue but a mainboard/bios one. Browsing the manual of your board, please check the ACS setting in bios.

Asc was disabled when I made this question

Then you should likely check with Supermicro if this is a supported setup. Manually fiddling with the ACS bit:
https://forums.developer.nvidia.com/t/multi-gpu-peer-to-peer-access-failing-on-tesla-k80/39748/15?u=generix

IOMMU_enable_ACS_disable.txt (16.0 KB)
IOMMU_disable_ACS_enable.txt (16.1 KB)
IOMMU_disable_ACS_disable.txt (16.1 KB)
The motherboard I’m using is Supermicro’s M12SWA-TF, I’m sure the BIOS has been updated to the latest from the official website, and the ASC has been turned off. But my situation is still the same, and I am not sure whether the 4090 supports P2P, because I see that the link you shared is running simpleP2P, and I have run simpleP2P
deviceQuery
p2pBandwidthLatencyTest. The result is in the attachment

I don’t have this problem with 4 A100s, only 4090s currently have this problem

The platform is still the original platform, I only changed the graphics card

I have communicated with Supermicro technical staff, they recognized the problem of ACS, but after checking my settings, they confirmed that ASC has been turned off, and they asked me to take 4 A100s for experiment, it is indeed possible to run p2psimple, but when I change Back to 4090, the problem remains
please help me