我用的CPU是AMD 5975WX,显卡是4块4090。cuda版本为cuda12,pytoch版本为2.0
我注意到最后一个调用堆栈在cuda上
错误报告见附件,我的代码也在附件
ex002_DataParallel.py (6.4 KB)
nvidia-bug-report.log.gz (1.2 MB)
cudalog.rtf (14.7 KB)
opened 03:35PM - 11 Feb 21 UTC
closed 12:47PM - 26 Dec 22 UTC
oncall: distributed
module: multi-gpu
module: cuda
triaged
module: deadlock
module: data parallel
module: ddp
## 🐛 Bug
Training CNN (include torchvision resnet18 and timm efficientnet) wi… th a single machine and multi-gpu using dataparallel cause deadlock in machines with AMD cpu, while the same code works well in the machines with Intel cpu.
The code run until forward pass, i.e., `output = model(images)` , inside the for loop in the training. It remains in the `model(images)` forever with gpu utilization go to 0% (memory is occupied, not 0), three cpu cores go to 100%, and other cpu cores go to 0%. The processes PID and GPU mempry usage remains after stopping with `ctrl+c` and `ctrll+z`. The `kill` , `pkill` , and `fuser -k /dev/nvidia*` commands cause zombie processes, also known as defunct and z state. The zombie processes have the parent pid of 1, so it cannot be killed. The only solution is to reboot the system.
The code works well in 3 machines with Intel cpu and has this issue in 4 machines with AMD cpu.
We tested on GTX 1080, Titan V, Titan RTX, Quadro RTX 8000, and RTX 3090. So, it is independent of gpu model.
**Note**: There is similar issue with Distributed Data Parallel (DDP).
## To Reproduce
Steps to reproduce the behavior:
1. use a machine with AMD cpu and multiple NVIDIA gpu
2. Linux, Python3.8, cuda 11.0, pytorch 1.7.1, torchvision 0.8.2
3. write a code to train a resnet18 model in torchvisaion
4. please test both Data Parallel (DP) and Distributed Data Parallel (DP)
## Expected behavior
1. code go to deadlock at forward pass of in the first epoch and the first iteration of training when using AMD cpu.
2. same code work well when using intel cpu
## Environment
#### Intel cpu environment (system 1)
Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz
```
PyTorch version: 1.7.1+cu110
Is debug build: False
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.10.2
Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 9.1.85
GPU models and configuration:
GPU 0: GeForce RTX 3090
GPU 1: GeForce RTX 3090
GPU 2: GeForce RTX 3090
GPU 3: GeForce RTX 3090
Nvidia driver version: 455.45.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip] Could not collect
[conda] Could not collect
```
#### Intel cpu environment (system 2)
Intel(R) Xeon(R) CPU E5-2699A v4 @ 2.40GHz
```
PyTorch version: 1.7.1+cu110
Is debug build: False
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: TITAN RTX
GPU 1: TITAN RTX
GPU 2: TITAN RTX
GPU 3: Quadro RTX 8000
Nvidia driver version: 450.80.02
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip] Could not collect
[conda] Could not collect
```
#### Intel cpu environment (system 3)
Intel(R) Xeon(R) CPU E5-2699A v4 @ 2.40GHz
```
PyTorch version: 1.7.1+cu110
Is debug build: False
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Tesla K80
GPU 1: Tesla K80
GPU 2: Tesla K80
GPU 3: Tesla K80
GPU 4: Tesla K80
GPU 5: Tesla K80
GPU 6: Tesla K80
GPU 7: Tesla K80
GPU 8: Tesla K80
GPU 9: Tesla K80
GPU 10: Tesla K80
GPU 11: Tesla K80
GPU 12: Tesla K80
GPU 13: Tesla K80
GPU 14: Tesla K80
GPU 15: Tesla K80
Nvidia driver version: 450.102.04
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip] numpy==1.14.3
[conda] Could not collect
```
#### AMD cpu environment (system 4)
AMD Eng Sample: 100-000000053-04_32/20_N
```
PyTorch version: 1.7.1+cu110
Is debug build: False
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.10.2
Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti
GPU 2: GeForce GTX 1080 Ti
GPU 3: GeForce GTX 1080 Ti
GPU 4: GeForce GTX 1080 Ti
Nvidia driver version: 450.102.04
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip] Could not collect
[conda] Could not collect
```
#### AMD cpu environment (system 5)
AMD Eng Sample: 100-000000053-04_32/20_N
```
PyTorch version: 1.7.1+cu110
Is debug build: False
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: TITAN V
GPU 1: TITAN V
Nvidia driver version: 455.45.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip] Could not collect
[conda] Could not collect
```
#### AMD cpu environment (system 6)
AMD Opteron(tm) Processor 6380
```
PyTorch version: 1.7.1+cu110
Is debug build: False
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.10.2
Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: TITAN V
GPU 1: TITAN V
Nvidia driver version: 450.102.04
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip] Could not collect
[conda] Could not collect
```
#### AMD cpu environment (system 7)
AMD Opteron(tm) Processor 6380
```
PyTorch version: 1.7.1+cu110
Is debug build: False
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.10.2
Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti
GPU 2: GeForce GTX 1080 Ti
Nvidia driver version: 450.102.04
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn.so.8.1.0
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.1.0
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.1.0
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.1.0
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.1.0
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.1.0
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.1.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip] Could not collect
[conda] Could not collect
```
cc @ngimel @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu
Some information I have learned is that there is no such problem with intel CPUs. The solution given in this Issue is to turn off the IOMMU of the motherboard, but this option is what I need, so I want to know if there are other solutions
I’m not sure if there is a compatibility issue with AMD’s CPU and 4090
very eager to get help
generix
December 23, 2022, 12:40pm
2
This is not a cpu issue but a mainboard/bios one. Browsing the manual of your board, please check the ACS setting in bios.
Asc was disabled when I made this question
generix
December 23, 2022, 12:51pm
4
Then you should likely check with Supermicro if this is a supported setup. Manually fiddling with the ACS bit:
https://forums.developer.nvidia.com/t/multi-gpu-peer-to-peer-access-failing-on-tesla-k80/39748/15?u=generix
IOMMU_enable_ACS_disable.txt (16.0 KB)
IOMMU_disable_ACS_enable.txt (16.1 KB)
IOMMU_disable_ACS_disable.txt (16.1 KB)
The motherboard I’m using is Supermicro’s M12SWA-TF, I’m sure the BIOS has been updated to the latest from the official website, and the ASC has been turned off. But my situation is still the same, and I am not sure whether the 4090 supports P2P, because I see that the link you shared is running simpleP2P, and I have run simpleP2P
deviceQuery
p2pBandwidthLatencyTest. The result is in the attachment
I don’t have this problem with 4 A100s, only 4090s currently have this problem
The platform is still the original platform, I only changed the graphics card
I have communicated with Supermicro technical staff, they recognized the problem of ACS, but after checking my settings, they confirmed that ASC has been turned off, and they asked me to take 4 A100s for experiment, it is indeed possible to run p2psimple, but when I change Back to 4090, the problem remains
please help me