Hi! Looking to get some help on the forum as we’ve read many solutions but couldn’t get any to work :(
We have been attempting to install cuda on our server with a nvidia A100 GPU, but with no avail. Below are some notable configurations:
CPU: AMD threadripper pro 7975
MB: AMD WRX90
RAM: DDR5 4800 256gb
GPU: A100 80gb x1 , gt 730 x1
SSD : Samsung 990 Pro 2tb
OS: Ubuntu 22.04 LTS
Cuda version required: 11.8
Nvidia driver attempted:
545
535
520
BIOS settings:
Safe boot → disabled
Fast boot → disabled
Above 4G decoding → enabled
Resize BAR support → enabled
Our problem is that when we verify the installation with the smi command:
Nvidia-smi
We receive the following error and have not been able to get past it:
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
The nvidia-bug-report.log file contains the following notable lines:
kernel: [ 22.932086] NVRM: The NVIDIA probe routine failed for 1 device(s).
kernel: [ 22.932087] NVRM: None of the NVIDIA devices were initialized.
kernel: [ 22.932241] nvidia-nvlink: Unregistered Nvlink Core, major device number 236
kernel: [ 23.158996] nvidia-nvlink: Nvlink Core is being initialized, major device
kernel: [ 23.159003] NVRM: This PCI I/O region assigned to your NVIDIA device is
kernel: [ 23.159003] NVRM: BAR0 is 0M @ 0x0 (PCI:0000:e1:00.0)
*** ls: ls: cannot access ‘/sys/class/drm/*/device/driver’: No such file or directory
/dev/dri not present
/usr/bin/nvidia-debugdump -D
Error: nvmlInit(): Driver Not Loaded
Skipping vulkaninfo output (vulkaninfo not found)
/usr/bin/nvidia-smi --query
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running
The interesting thing is that when we uninstall our nvidia driver and reinstall nvidia driver 470, the nvidia-smi command returns the proper nvidia driver version and the cuda version. However, as we need cuda 11.8 or above, we needed to install nvidia-driver-520 or newer, and unfortunately the problem keeps occurring for those.
Another thing to note is that we originally had another graphics card plugged in (GT 730), in addition to our A100. The GT 730 is compatible with nvidia-driver-470, which at some point led us to think that having two graphics cards was the culprit. However, we did also try the following to no avail:
Unplug the GT 730 graphics card
Removing all nvidia driver information
Reboot the machine
Install nvidia-driver-520