Issue with CUDA 11.7 drivers not allowing multiple A30 GPUs to initialize

brandt33 · June 13, 2022, 10:01pm

When running on a QUAD A30 server, after installing CUDA 11.7, the GPUs are not initializing. They can be seen with lspci, but then are not available to the TAO Jupyter Notebook. Putting the 11.6.1 drivers back in solves the problem and allows the Jupyter Notebook to run normally.

dell@r750xa-cty1mh3:~$ lspci | grep -i NVIDIA
17:00.0 3D controller: NVIDIA Corporation Device 20b7 (rev a1)
65:00.0 3D controller: NVIDIA Corporation Device 20b7 (rev a1)
ca:00.0 3D controller: NVIDIA Corporation Device 20b7 (rev a1)
e3:00.0 3D controller: NVIDIA Corporation Device 20b7 (rev a1)
dell@r750xa-cty1mh3:~$ nvidia-smi
No devices were found
dell@r750xa-cty1mh3:~$   ```

Robert_Crovella · June 13, 2022, 10:16pm

You’re installing the driver incorrectly. There is not enough information here to diagnose.

You may want to read the linux install guide. For example, if your previous drivers were installed with a runfile installer, and then you used the package manager method to install CUDA 11.7, that will break things. I’m not suggesting that is what happened here, just pointing out an example of one way it could happen.

brandt33 · June 14, 2022, 3:24am

Hi Robert, Thank you for the quick feedback. Hopefully that’s the case, knowing that this isn’t something that has been seen elsewhere will help as we go to revisit this in the near future … for now am using the previous driver to do experiments.

After I installed the CUDA drivers and ran into the GPU initialization failure, my colleague reimaged the server and did a clean install with the same result, so we will need to compare notes and go step by step again. Will post more detailed logs if this occurs as we look again.
Thanks,
Brandt