我用的CPU是AMD 5975WX，显卡是4块4090。cuda版本为cuda12，pytoch版本为2.0

346283191 · December 23, 2022, 9:41am

我注意到最后一个调用堆栈在cuda上

错误报告见附件，我的代码也在附件
ex002_DataParallel.py (6.4 KB)
nvidia-bug-report.log.gz (1.2 MB)
cudalog.rtf (14.7 KB)

github.com/pytorch/pytorch

Deadlock in a single machine multi-gpu using dataparlel when cpu is AMD

opened 03:35PM - 11 Feb 21 UTC

closed 12:47PM - 26 Dec 22 UTC

ROBOTICSENGINEER

oncall: distributed module: multi-gpu module: cuda triaged module: deadlock module: data parallel module: ddp

## 🐛 Bug Training CNN (include torchvision resnet18 and timm efficientnet) wi…th a single machine and multi-gpu using dataparallel cause deadlock in machines with AMD cpu, while the same code works well in the machines with Intel cpu. The code run until forward pass, i.e., `output = model(images)` , inside the for loop in the training. It remains in the `model(images)` forever with gpu utilization go to 0% (memory is occupied, not 0), three cpu cores go to 100%, and other cpu cores go to 0%. The processes PID and GPU mempry usage remains after stopping with `ctrl+c` and `ctrll+z`. The `kill` , `pkill` , and `fuser -k /dev/nvidia*` commands cause zombie processes, also known as defunct and z state. The zombie processes have the parent pid of 1, so it cannot be killed. The only solution is to reboot the system. The code works well in 3 machines with Intel cpu and has this issue in 4 machines with AMD cpu. We tested on GTX 1080, Titan V, Titan RTX, Quadro RTX 8000, and RTX 3090. So, it is independent of gpu model. **Note**: There is similar issue with Distributed Data Parallel (DDP). ## To Reproduce Steps to reproduce the behavior: 1. use a machine with AMD cpu and multiple NVIDIA gpu 2. Linux, Python3.8, cuda 11.0, pytorch 1.7.1, torchvision 0.8.2 3. write a code to train a resnet18 model in torchvisaion 4. please test both Data Parallel (DP) and Distributed Data Parallel (DP) ## Expected behavior 1. code go to deadlock at forward pass of in the first epoch and the first iteration of training when using AMD cpu. 2. same code work well when using intel cpu ## Environment #### Intel cpu environment (system 1) Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz ``` PyTorch version: 1.7.1+cu110 Is debug build: False CUDA used to build PyTorch: 11.0 ROCM used to build PyTorch: N/A OS: Ubuntu 18.04.5 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: Could not collect CMake version: version 3.10.2 Python version: 3.8 (64-bit runtime) Is CUDA available: True CUDA runtime version: 9.1.85 GPU models and configuration: GPU 0: GeForce RTX 3090 GPU 1: GeForce RTX 3090 GPU 2: GeForce RTX 3090 GPU 3: GeForce RTX 3090 Nvidia driver version: 455.45.01 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0 HIP runtime version: N/A MIOpen runtime version: N/A Versions of relevant libraries: [pip] Could not collect [conda] Could not collect ``` #### Intel cpu environment (system 2) Intel(R) Xeon(R) CPU E5-2699A v4 @ 2.40GHz ``` PyTorch version: 1.7.1+cu110 Is debug build: False CUDA used to build PyTorch: 11.0 ROCM used to build PyTorch: N/A OS: Ubuntu 18.04.5 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: Could not collect CMake version: Could not collect Python version: 3.8 (64-bit runtime) Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: TITAN RTX GPU 1: TITAN RTX GPU 2: TITAN RTX GPU 3: Quadro RTX 8000 Nvidia driver version: 450.80.02 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0 HIP runtime version: N/A MIOpen runtime version: N/A Versions of relevant libraries: [pip] Could not collect [conda] Could not collect ``` #### Intel cpu environment (system 3) Intel(R) Xeon(R) CPU E5-2699A v4 @ 2.40GHz ``` PyTorch version: 1.7.1+cu110 Is debug build: False CUDA used to build PyTorch: 11.0 ROCM used to build PyTorch: N/A OS: Ubuntu 18.04.5 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: Could not collect CMake version: Could not collect Python version: 3.8 (64-bit runtime) Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: Tesla K80 GPU 1: Tesla K80 GPU 2: Tesla K80 GPU 3: Tesla K80 GPU 4: Tesla K80 GPU 5: Tesla K80 GPU 6: Tesla K80 GPU 7: Tesla K80 GPU 8: Tesla K80 GPU 9: Tesla K80 GPU 10: Tesla K80 GPU 11: Tesla K80 GPU 12: Tesla K80 GPU 13: Tesla K80 GPU 14: Tesla K80 GPU 15: Tesla K80 Nvidia driver version: 450.102.04 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0 HIP runtime version: N/A MIOpen runtime version: N/A Versions of relevant libraries: [pip] numpy==1.14.3 [conda] Could not collect ``` #### AMD cpu environment (system 4) AMD Eng Sample: 100-000000053-04_32/20_N ``` PyTorch version: 1.7.1+cu110 Is debug build: False CUDA used to build PyTorch: 11.0 ROCM used to build PyTorch: N/A OS: Ubuntu 18.04.5 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: Could not collect CMake version: version 3.10.2 Python version: 3.8 (64-bit runtime) Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: GeForce GTX 1080 Ti GPU 1: GeForce GTX 1080 Ti GPU 2: GeForce GTX 1080 Ti GPU 3: GeForce GTX 1080 Ti GPU 4: GeForce GTX 1080 Ti Nvidia driver version: 450.102.04 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0 HIP runtime version: N/A MIOpen runtime version: N/A Versions of relevant libraries: [pip] Could not collect [conda] Could not collect ``` #### AMD cpu environment (system 5) AMD Eng Sample: 100-000000053-04_32/20_N ``` PyTorch version: 1.7.1+cu110 Is debug build: False CUDA used to build PyTorch: 11.0 ROCM used to build PyTorch: N/A OS: Ubuntu 18.04.5 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: Could not collect CMake version: Could not collect Python version: 3.8 (64-bit runtime) Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: TITAN V GPU 1: TITAN V Nvidia driver version: 455.45.01 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0 HIP runtime version: N/A MIOpen runtime version: N/A Versions of relevant libraries: [pip] Could not collect [conda] Could not collect ``` #### AMD cpu environment (system 6) AMD Opteron(tm) Processor 6380 ``` PyTorch version: 1.7.1+cu110 Is debug build: False CUDA used to build PyTorch: 11.0 ROCM used to build PyTorch: N/A OS: Ubuntu 18.04.5 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: Could not collect CMake version: version 3.10.2 Python version: 3.8 (64-bit runtime) Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: TITAN V GPU 1: TITAN V Nvidia driver version: 450.102.04 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0 HIP runtime version: N/A MIOpen runtime version: N/A Versions of relevant libraries: [pip] Could not collect [conda] Could not collect ``` #### AMD cpu environment (system 7) AMD Opteron(tm) Processor 6380 ``` PyTorch version: 1.7.1+cu110 Is debug build: False CUDA used to build PyTorch: 11.0 ROCM used to build PyTorch: N/A OS: Ubuntu 18.04.5 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: Could not collect CMake version: version 3.10.2 Python version: 3.8 (64-bit runtime) Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: GeForce GTX 1080 Ti GPU 1: GeForce GTX 1080 Ti GPU 2: GeForce GTX 1080 Ti Nvidia driver version: 450.102.04 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn.so.8.1.0 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.1.0 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.1.0 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.1.0 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.1.0 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.1.0 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.1.0 HIP runtime version: N/A MIOpen runtime version: N/A Versions of relevant libraries: [pip] Could not collect [conda] Could not collect ``` cc @ngimel @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu

Some information I have learned is that there is no such problem with intel CPUs. The solution given in this Issue is to turn off the IOMMU of the motherboard, but this option is what I need, so I want to know if there are other solutions

I’m not sure if there is a compatibility issue with AMD’s CPU and 4090

very eager to get help

generix · December 23, 2022, 12:40pm

This is not a cpu issue but a mainboard/bios one. Browsing the manual of your board, please check the ACS setting in bios.

346283191 · December 23, 2022, 12:43pm

Asc was disabled when I made this question

generix · December 23, 2022, 12:51pm

Then you should likely check with Supermicro if this is a supported setup. Manually fiddling with the ACS bit:
https://forums.developer.nvidia.com/t/multi-gpu-peer-to-peer-access-failing-on-tesla-k80/39748/15?u=generix

346283191 · December 24, 2022, 2:22am

IOMMU_enable_ACS_disable.txt (16.0 KB)
IOMMU_disable_ACS_enable.txt (16.1 KB)
IOMMU_disable_ACS_disable.txt (16.1 KB)
The motherboard I’m using is Supermicro’s M12SWA-TF, I’m sure the BIOS has been updated to the latest from the official website, and the ASC has been turned off. But my situation is still the same, and I am not sure whether the 4090 supports P2P, because I see that the link you shared is running simpleP2P, and I have run simpleP2P
deviceQuery
p2pBandwidthLatencyTest. The result is in the attachment

346283191 · December 26, 2022, 9:30am

I don’t have this problem with 4 A100s, only 4090s currently have this problem

The platform is still the original platform, I only changed the graphics card

346283191 · December 29, 2022, 5:57pm

I have communicated with Supermicro technical staff, they recognized the problem of ACS, but after checking my settings, they confirmed that ASC has been turned off, and they asked me to take 4 A100s for experiment, it is indeed possible to run p2psimple, but when I change Back to 4090, the problem remains
please help me

Topic		Replies	Views
Parallel training with 4 cards 4090 cannot be performed on AMD 5975WX， stuck at the beginning CUDA Programming and Performance	14	6006	February 20, 2023
Standard nVidia CUDA tests fail with dual RTX 4090 Linux box Linux	54	21444	April 29, 2024
P2P Communication Fails 1080ti->1080ti. IOMMU & ACS disabled Linux	2	1581	October 12, 2021
TAO5 - Detectnet_v2 - MultiGPU TAO API Stuck - EXTRA GPU TAO Toolkit	14	1007	November 7, 2023
Strange problem with CUDA on 3 GPUS (5090, 6000 Ada, RTX8000) CUDA Setup and Installation cuda , pytorch	2	359	June 23, 2025
Multi-GPU Peer to Peer access failing on Tesla K80 CUDA Programming and Performance	25	25901	November 24, 2016
System hangs after executing p2p Bandwidth test on Tesla k40 NVIDIA gpus CUDA Programming and Performance	4	1422	February 3, 2018
Peer access not supported between devices CUDA Programming and Performance	11	7337	November 9, 2017
Two Quadro M4000 capable of P2P but no access CUDA Programming and Performance	10	1947	November 7, 2016
simpleP2P example and multi-GPU network training causes system freeze and ERR in nvidia-smi Linux	7	3918	October 14, 2021

我用的CPU是AMD 5975WX，显卡是4块4090。cuda版本为cuda12，pytoch版本为2.0

Related topics