Parallel training with 4 cards 4090 cannot be performed on AMD 5975WX， stuck at the beginning

346283191 · December 23, 2022, 7:09am

I use a CPU of AMD 5975WX, and four 4090 graphics cards. cuda version is cuda12, pytoch version is 2.0

I noticed that the last call stack is on cuda

See attachment for bug report，My code is also in the attachment
ex002_DataParallel.py (6.4 KB)

njuffa · December 23, 2022, 7:26am

(1) Post text as text, not as images.
(2) Cut and paste relevant error messages instead of attaching some giant log file.

Out of interest: What kind of power supply is used in this system? 3200 Watts?

346283191 · December 23, 2022, 7:42am

cudalog.rtf (14.7 KB)
ok, here is the wrong text report。

github.com/pytorch/pytorch

Deadlock in a single machine multi-gpu using dataparlel when cpu is AMD

opened 03:35PM - 11 Feb 21 UTC

ROBOTICSENGINEER

oncall: distributed module: multi-gpu module: cuda triaged module: deadlock module: data parallel module: ddp

## 🐛 Bug Training CNN (include torchvision resnet18 and timm efficientnet) wi…th a single machine and multi-gpu using dataparallel cause deadlock in machines with AMD cpu, while the same code works well in the machines with Intel cpu. The code run until forward pass, i.e., `output = model(images)` , inside the for loop in the training. It remains in the `model(images)` forever with gpu utilization go to 0% (memory is occupied, not 0), three cpu cores go to 100%, and other cpu cores go to 0%. The processes PID and GPU mempry usage remains after stopping with `ctrl+c` and `ctrll+z`. The `kill` , `pkill` , and `fuser -k /dev/nvidia*` commands cause zombie processes, also known as defunct and z state. The zombie processes have the parent pid of 1, so it cannot be killed. The only solution is to reboot the system. The code works well in 3 machines with Intel cpu and has this issue in 4 machines with AMD cpu. We tested on GTX 1080, Titan V, Titan RTX, Quadro RTX 8000, and RTX 3090. So, it is independent of gpu model. **Note**: There is similar issue with Distributed Data Parallel (DDP). ## To Reproduce Steps to reproduce the behavior: 1. use a machine with AMD cpu and multiple NVIDIA gpu 2. Linux, Python3.8, cuda 11.0, pytorch 1.7.1, torchvision 0.8.2 3. write a code to train a resnet18 model in torchvisaion 4. please test both Data Parallel (DP) and Distributed Data Parallel (DP) ## Expected behavior 1. code go to deadlock at forward pass of in the first epoch and the first iteration of training when using AMD cpu. 2. same code work well when using intel cpu ## Environment #### Intel cpu environment (system 1) Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz ``` PyTorch version: 1.7.1+cu110 Is debug build: False CUDA used to build PyTorch: 11.0 ROCM used to build PyTorch: N/A OS: Ubuntu 18.04.5 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: Could not collect CMake version: version 3.10.2 Python version: 3.8 (64-bit runtime) Is CUDA available: True CUDA runtime version: 9.1.85 GPU models and configuration: GPU 0: GeForce RTX 3090 GPU 1: GeForce RTX 3090 GPU 2: GeForce RTX 3090 GPU 3: GeForce RTX 3090 Nvidia driver version: 455.45.01 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0 HIP runtime version: N/A MIOpen runtime version: N/A Versions of relevant libraries: [pip] Could not collect [conda] Could not collect ``` #### Intel cpu environment (system 2) Intel(R) Xeon(R) CPU E5-2699A v4 @ 2.40GHz ``` PyTorch version: 1.7.1+cu110 Is debug build: False CUDA used to build PyTorch: 11.0 ROCM used to build PyTorch: N/A OS: Ubuntu 18.04.5 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: Could not collect CMake version: Could not collect Python version: 3.8 (64-bit runtime) Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: TITAN RTX GPU 1: TITAN RTX GPU 2: TITAN RTX GPU 3: Quadro RTX 8000 Nvidia driver version: 450.80.02 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0 HIP runtime version: N/A MIOpen runtime version: N/A Versions of relevant libraries: [pip] Could not collect [conda] Could not collect ``` #### Intel cpu environment (system 3) Intel(R) Xeon(R) CPU E5-2699A v4 @ 2.40GHz ``` PyTorch version: 1.7.1+cu110 Is debug build: False CUDA used to build PyTorch: 11.0 ROCM used to build PyTorch: N/A OS: Ubuntu 18.04.5 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: Could not collect CMake version: Could not collect Python version: 3.8 (64-bit runtime) Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: Tesla K80 GPU 1: Tesla K80 GPU 2: Tesla K80 GPU 3: Tesla K80 GPU 4: Tesla K80 GPU 5: Tesla K80 GPU 6: Tesla K80 GPU 7: Tesla K80 GPU 8: Tesla K80 GPU 9: Tesla K80 GPU 10: Tesla K80 GPU 11: Tesla K80 GPU 12: Tesla K80 GPU 13: Tesla K80 GPU 14: Tesla K80 GPU 15: Tesla K80 Nvidia driver version: 450.102.04 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0 HIP runtime version: N/A MIOpen runtime version: N/A Versions of relevant libraries: [pip] numpy==1.14.3 [conda] Could not collect ``` #### AMD cpu environment (system 4) AMD Eng Sample: 100-000000053-04_32/20_N ``` PyTorch version: 1.7.1+cu110 Is debug build: False CUDA used to build PyTorch: 11.0 ROCM used to build PyTorch: N/A OS: Ubuntu 18.04.5 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: Could not collect CMake version: version 3.10.2 Python version: 3.8 (64-bit runtime) Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: GeForce GTX 1080 Ti GPU 1: GeForce GTX 1080 Ti GPU 2: GeForce GTX 1080 Ti GPU 3: GeForce GTX 1080 Ti GPU 4: GeForce GTX 1080 Ti Nvidia driver version: 450.102.04 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0 HIP runtime version: N/A MIOpen runtime version: N/A Versions of relevant libraries: [pip] Could not collect [conda] Could not collect ``` #### AMD cpu environment (system 5) AMD Eng Sample: 100-000000053-04_32/20_N ``` PyTorch version: 1.7.1+cu110 Is debug build: False CUDA used to build PyTorch: 11.0 ROCM used to build PyTorch: N/A OS: Ubuntu 18.04.5 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: Could not collect CMake version: Could not collect Python version: 3.8 (64-bit runtime) Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: TITAN V GPU 1: TITAN V Nvidia driver version: 455.45.01 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0 HIP runtime version: N/A MIOpen runtime version: N/A Versions of relevant libraries: [pip] Could not collect [conda] Could not collect ``` #### AMD cpu environment (system 6) AMD Opteron(tm) Processor 6380 ``` PyTorch version: 1.7.1+cu110 Is debug build: False CUDA used to build PyTorch: 11.0 ROCM used to build PyTorch: N/A OS: Ubuntu 18.04.5 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: Could not collect CMake version: version 3.10.2 Python version: 3.8 (64-bit runtime) Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: TITAN V GPU 1: TITAN V Nvidia driver version: 450.102.04 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0 HIP runtime version: N/A MIOpen runtime version: N/A Versions of relevant libraries: [pip] Could not collect [conda] Could not collect ``` #### AMD cpu environment (system 7) AMD Opteron(tm) Processor 6380 ``` PyTorch version: 1.7.1+cu110 Is debug build: False CUDA used to build PyTorch: 11.0 ROCM used to build PyTorch: N/A OS: Ubuntu 18.04.5 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: Could not collect CMake version: version 3.10.2 Python version: 3.8 (64-bit runtime) Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: GeForce GTX 1080 Ti GPU 1: GeForce GTX 1080 Ti GPU 2: GeForce GTX 1080 Ti Nvidia driver version: 450.102.04 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn.so.8.1.0 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.1.0 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.1.0 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.1.0 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.1.0 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.1.0 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.1.0 HIP runtime version: N/A MIOpen runtime version: N/A Versions of relevant libraries: [pip] Could not collect [conda] Could not collect ``` cc @ngimel @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu

Some information I have learned is that there is no such problem with intel CPUs. The solution given in this Issue is to turn off the IOMMU of the motherboard, but this option is what I need, so I want to know if there are other solutions

I use two 1600W power supplies. A power supply is connected to 2 graphics cards, and the motherboard.
Another power supply is connected to the remaining 2 graphics cards and other devices

njuffa · December 23, 2022, 8:23am

As far as I understand the linked thread, the working hypothesis there is that the issue is due to an incompatibility between NVIDIA’s NCCL and AMD’s IOMMU. That is outside my area of expertise, and frankly, appears to have nothing to do with CUDA.

Off-hand I do not know of a sub-forum for NCCL (in fact I did not know of NCCL’s existence until just now). If there is more corroborating evidence strengthening the hypothesis, that might lead to a classical “it’s the other vendor’s fault” finger-pointing exercise.

Consider filing a bug with NVIDIA.

Robert_Crovella · December 23, 2022, 4:54pm

346283191 · December 24, 2022, 2:22am

The motherboard I’m using is Supermicro’s M12SWA-TF, I’m sure the BIOS has been updated to the latest from the official website, and the ASC has been turned off. But my situation is still the same, and I am not sure whether the 4090 supports P2P, because I see that the link you shared is running simpleP2P, and I have run simpleP2P
deviceQuery
p2pBandwidthLatencyTest. The result is in the attachment

IOMMU_enable_ACS_disable.txt (16.0 KB)
IOMMU_disable_ACS_enable.txt (16.1 KB)
IOMMU_disable_ACS_disable.txt (16.1 KB)

Robert_Crovella · December 25, 2022, 3:16pm

The failures in your attachments indicate a problem with the motherboard. You should address this with Supermicro.

346283191 · December 26, 2022, 9:25am

I don’t have this problem with 4 A100s, only 4090s currently have this problem

The platform is still the original platform, I only changed the graphics card

please help me

346283191 · December 29, 2022, 5:58pm

I have communicated with Supermicro technical staff, they recognized the problem of ACS, but after checking my settings, they confirmed that ASC has been turned off, and they asked me to take 4 A100s for experiment, it is indeed possible to run p2psimple, but when I change Back to 4090, the problem remains
please help me

jaybob20 · January 29, 2023, 7:01pm

FWIW, I have 2 4090s and I’m having same problems training deep learning models on a system that worked fine with 2 3090s.
I’ve tried all the bios settings (which I didn’t need with 3090s)
I’ve tried NCCL_P2P_DISABLE=1 that sometimes works.
I also get a very hard lockup training with Tensorflow 2.5 that requires an unplug to reboot the system.
I believe this is a Nvidia driver/hardware issue for 4090
Threadripper 3970X, ASUS motherboard.

akshu2023 · January 29, 2023, 10:43pm

Its a bummer. About to get my WRX80 WS next week with 2 4090s and 5975WX. Why is this issue not widespread? Aren’t workstation builders not seeing this issue when assembling/selling AMD WS with multiple 4090s?

akshu2023 · January 29, 2023, 11:05pm

Is turning amd_iommu off the solution in some configurations?

jaybob20 · January 30, 2023, 6:56am

iommu was one of the bios settings I tried and it didn’t help.

akshu2023 · February 6, 2023, 9:17pm

@jaybob20 , you might be aware from the other ticket. An issue is being worked on by nVidia.

system · February 20, 2023, 9:17pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
我用的CPU是AMD 5975WX，显卡是4块4090。cuda版本为cuda12，pytoch版本为2.0 Linux chinese	6	2027	December 29, 2022
Standard nVidia CUDA tests fail with dual RTX 4090 Linux box Linux	54	21445	April 29, 2024
P2P Communication Fails 1080ti->1080ti. IOMMU & ACS disabled Linux	2	1581	October 12, 2021
TAO5 - Detectnet_v2 - MultiGPU TAO API Stuck - EXTRA GPU TAO Toolkit	14	1007	November 7, 2023
Strange problem with CUDA on 3 GPUS (5090, 6000 Ada, RTX8000) CUDA Setup and Installation cuda , pytorch	2	360	June 23, 2025
simpleP2P example and multi-GPU network training causes system freeze and ERR in nvidia-smi Linux	7	3919	October 14, 2021
CUDA P2P crash with threadripper CUDA Programming and Performance	5	1117	November 17, 2017
RTX4090, torch, kernel tried to execute NX-protected page Linux	7	1235	May 13, 2024
System hangs after executing p2p Bandwidth test on Tesla k40 NVIDIA gpus CUDA Programming and Performance	4	1422	February 3, 2018
Low P2P GPU bandwidth performance between GeForce GPUs CUDA Programming and Performance	20	1100	October 9, 2024

Parallel training with 4 cards 4090 cannot be performed on AMD 5975WX， stuck at the beginning

Related topics