CUDA Drivers Fail in Multiple Ways After Fresh Install (Linux)


I’ve been trying to fix my CUDA drivers on a CentOS 7 system for the past day with no luck. I followed the uninstall and fresh install instructions from here. Everything installs fine; however, none of the commands you would expect to work afterwards do. The install seems fine I can do nvcc -V and similar commands. Commands like nvidia-smi, nvidia-settings do not work. I will paste the output of each command I tried below:

cat /proc/driver/nvidia/version → No such file or directory
cat /sys/module/nvidia/version → No such file or directory
nvidia-smi → NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
nvidia-settings → ERROR: The control display is undefined; please run nvidia-settings --help for usage information.
dkms status nvidia → nvidia/535.86.10: added
nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jul_11_02:20:44_PDT_2023
Cuda compilation tools, release 12.2, V12.2.128
Build cuda_12.2.r12.2/compiler.33053471_0

The weirdest part to me is that I can’t grab the driver version with the first two commands. Before I uninstalled the old version these files existed and the clean install did not re-create them.

Any help would be greatly appreciated! Thanks.

first, if you have not already done so, reboot and check again.
then if you still have the same report, the driver installation has failed for some reason.

Using the runfile install method, we have a log file we can inspect for driver install failures. Since you have used the package manager install method, its important to inspect the log that is spit out to the console during package manager install (yes, its voluminous). And you have to know what to look for. Casual skimming may miss errors. I don’t have a guide to offer to learn how to diagnose a failed package manager install after the fact, after the install spew is gone. In some cases, studying dmesg output may help. There may also be dkms output to study, but I don’t have a guide to point you to.

I have rebooted.

Would you reccomend trying the rpm or run file version instead of package manager to see if I get different results? Otherwise I’ll go digging through log files.

rpm is a package manager method, of sorts. If the package manager method (yum install cuda …) doesn’t work, I personally wouldn’t bother trying any more detailed rpm methods. Do as you wish, of course. If you are a packaging expert, it may be second nature for you.

The runfile method has some advantages e.g. for the reason I indicated. So one approach would be to clean up again, try the runfile method, and if it works, great. If not, inspect the logs.

Otherwise you could try the package manager method after doing a clean install of Centos7. Yes, that isn’t pleasant. Starting with a “fresh” environment often helps with package manager install issues. The package manager method inherently depends on machine history. The runfile method can be slightly less dependent on machine history for reliable install (but you do need to do the CUDA cleaning already described in the manual you linked.) If you choose this method, either study the install spew carefully, or copy and paste it out of your terminal session so you can post it in a forum question if needed

I’m also assuming you did yum install cuda and not yum install cuda-toolkit

You might also indicate what GPU you are using. It’s possible your GPU is old and not supported by R535 driver.

Its 8 2800 Supers pretty sure it was compatible when I checked.

Runfile gives an error at least:

ERROR: Unable to find the kernel source tree for the currently running kernel. Please make sure you have installed the kernel source files for your kernel and that they are properly configured; on Red Hat Linux systems, for example, be sure you have the 'kernel-source' or 'kernel-devel' RPM installed. If you know the correct kernel source files are installed, you may specify the kernel source path with the '--kernel-source-path' command line option.

I’ll do some googling but if you’ve seen that before let me know. It seems to imply that:
sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r)
was not run but I have done that.

Yes, that is what it means. On a non-fresh install it can sometimes be tricky to get the proper kernel headers installed, in a known or discoverable place. I’ve done numerous CentOS 7 installs without having that problem, but I’ve for example never updated the kernel.