I had a working .deb PM install of CUDA 8.0 on Ubuntu 16.04. Later, I added the ppa:graphics-drivers/ppa for apt-get updating of the drivers.
I upgraded the graphics driver (probably to version 378) with apt-get upgrade. This broke CUDA, nvidia-smi, etc.
I decided to do a full clean install of the CUDA- and nvidia-drivers stack to fix this.
The CUDA- and nvidia-stack were removed with “apt-get autoremove --purge cuda* nvidia-*”. I manually removed all config and remaining directories.
I redid the CUDA .deb install as described in the Installation Guide Installation Guide Linux :: CUDA Toolkit Documentation, but I am still getting the same errors.
The current driver shows as with the command ‘cat /proc/driver/nvidia/version’:
Errors:
nvidia-smi: Failed to initialize NVML: Driver/library version mismatch
deviceQuery: FAILS
My questions are the following:
Are there incompatibilities between the graphics-drivers ppa repo and CUDA .deb install?
Nvidia does not mention the required or even preferred type of driver install in the install docs:
What is the most stable way to install? .run-file graphics driver + .run-file cuda? Or both with package manager .deb method?
Is there a way I can use the graphics-drivers/ppa with CUDA or is this ill-advised?
Can I fix this mess without rebooting the server?
My best guess is to remove the graphics-drivers/ppa, purge the whole stack again, do a full reinstall with the .deb-files.
There can be, yes. Those driver sources are not maintained by the same group. NVIDIA drivers can be packaged in a variety of ways, for a variety of purposes.
This is because there is no preferred method. Both methods have strengths and weaknesses, and both serve particular purposes. The key is to not mix between the runfile install method and the PM install method, and this is pretty evident from the guide in my opinion.
It might work, it might not. YMMV. Since NVIDIA does not control those packages, anything is possible, and an answer applicable to a particular driver may not be applicable to another. An answer applicable today may not be applicable tomorrow. Given that uncertainty, the strong recommendation, at least for cuda usage, is to install via the instructions in the linux install guide, using binaries provided by nvidia, and that rules out the packages from other sources.
That would be my best guess too. Purge all NVIDIA GPU related packages, regardless of source, and start over strictly following the method provided for in the CUDA linux install guide, and using binaries either from www.nvidia.com or CUDA Toolkit 12.3 Update 1 Downloads | NVIDIA Developer
Well, I started doing a full .run file-based reinstall but the catch is: there does not seem to be a way to install the driver without rebooting. And this is a problem because we cannot reboot our R&D server right now. But that is another issue.