We have two Nvidia A100’s installed in a Dell PowerEdge R940. We have CUDA 11.6 installed with Driver Version: 470.129.06. We are using Pytorch for our development.
1.) When MIG devices are created using the root account using sudo nvidia-smi mig -cgi 2g.20gb,2g.20gb -C, Pytorch works when using the root account. The CUDA driver is initialized and everything works.
The issue is that MIG devices are not visible under a non-root user and also Pytorch does not detect anything
- when I disable MIG (so just using the baremetal GPUs), Pytorch detects the GPU
3 Below is the error when MIG is enabled under an non root users
/home/xxxxxx/.conda/envs/rca_rdf_python3.9/lib/python3.9/site-packages/torch/cuda/init.py:83: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at …/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0