Pytorch issue with MIG instances created on A100

We have two Nvidia A100’s installed in a Dell PowerEdge R940. We have CUDA 11.6 installed with Driver Version: 470.129.06. We are using Pytorch for our development.

1.) When MIG devices are created using the root account using sudo nvidia-smi mig -cgi 2g.20gb,2g.20gb -C, Pytorch works when using the root account. The CUDA driver is initialized and everything works.

The issue is that MIG devices are not visible under a non-root user and also Pytorch does not detect anything

  1. when I disable MIG (so just using the baremetal GPUs), Pytorch detects the GPU

3 Below is the error when MIG is enabled under an non root users

/home/xxxxxx/.conda/envs/rca_rdf_python3.9/lib/python3.9/site-packages/torch/cuda/ UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at …/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0

Please advise

We are running Ubuntu 18.04 LTS