NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running

Dear all,

I am having the same problem with a server-based GPU (Quadro P2000) ever since about a week ago.

I am running Ubuntu 22.04 on the VM, and trying to get everything done from the start didn’t work for me, either, as @user36321 mentioned in their latest comment. For reference, I tried installing version 555 of all needed packages via package manager installation, as I have another VM working with the same packages, and sudo apt-get remove --purge nvidia- -TAB- returns the same output (list):

nvidia-compute-utils-555 nvidia-firmware-555-555.42.06 nvidia-kernel-common-555
nvidia-container-toolkit nvidia-fs nvidia-kernel-source-555
nvidia-container-toolkit-base nvidia-fs-dkms nvidia-prime
nvidia-dkms-555 nvidia-gds nvidia-settings
nvidia-driver-555 nvidia-gds-12-5 nvidia-utils-555
nvidia-firmware-550-server-550.90.07 nvidia-gds-12-6

I just realised that there is a new reply on here and applied @GiovannaBM 's solution, but to no avail. Can anyone please help me further with this?

Thank you very much!

Hi I’ve encountered the same problem recently. I’ve had nvidia-driver-525 installed and everything was working fine. After running sudo apt upgrade I couldn’t use my GPU anymore.

Did a pruge, autoremove and clean of all nvidia drivers however it fails on the install now with nvidia-dkms --configure returning with error 10

First time reporting a but, I tried to upload the bug report but it wasn’t uploading…

nvidia-bug-report.log (2.5 MB)

Hello @nicolaypierre.95 and @chris.nicolaescu , welcome to the NVIDIA developer forums.

For chris.nicolaescu please check out my post from last year September, that is the correct Purge command. You MUST also run it from a terminal, not a graphical desktop environment and with all NVIDIA kernel modules unloaded.

nicolaypierre.95 you very likely have mismatching kernel headers installed for the 545 driver you are trying to install, because the DKMS kernel module compilation fails with missing constant declarations.

Thanks!

Hi @MarkusHoHo ,

First of all, thank you for welcoming me to the forum! Secondly, I have read this whole thread, and I have written my comment accordingly. I have done everything that I could with the information I have for the time being.

As I said in my original post, I did the purge, reinstalled and compared to a working VM’s installed packages (yes - the command used was the one you suggested in your previous - old - post). I just copied the sudo apt-get remove --purge nvidia- -TAB- input to the VM interface output in my message, for you to see the output list of package-manager-installed packages, relating to nvidia.

Further, the Ubuntu server kernel has no other interface THAN CLI, and these are remote servers. I am executing ALL commands on them via SSH. It is just that the VMs are separate, and they have two different GPUs (Quadro P2000) attached to each.

Thus, considering the above information, is there any other folder/source/kernel configuration or distinct change that I should check for between the two VMs, so as to at least understand why one has nvidia-smi working and the other doesn’t?

Next step is to run nvidia-bug-report.sh and attach the log here I would suggest.

Hi Markus, here it is: (unzipped - wanted to check it before I sent it)
nvidia-bug-report.log (160.5 KB)

I am having the same issue now. I rebooted my Ubuntu 22.04 machine on 8/19 and now the get the same error as others here. i have uninstalled everything and reinstalled. Still get the issue

Hi @chris.nicolaescu,

are you running a custom built kernel? For some reason there is no NVIDIA install log AND there is no log entry of the kernel even trying to load the GPU driver.

I am afraid based on that log I can’t help with this any further. You might want to try posting a fresh thread in the dedicated Linux category.

@cholland “The same error as others” sadly means that it can have a lot of different reasons.

Please try the different suggestions you can find in this and in the Linux category regarding “nvidia-smi has failed” or similar. Usually a clean driver re-install helps or checking that matching kernel header versions are installed.

If nothing helps, please go to the Linux category and read this post: »»»»»»»»»» If you have a problem, PLEASE read this first «««««««««« - #2 by aplattner

Thanks!

Hi again @MarkusHoHo ,

I did not intend on running a custom built kernel, no. Is there any way to get a kernel upgrade to a good version for the Quadro2000, so that I can then make everything run smoothly? Do you have any suggestions?

Thank you for noting that! After your last reply to @nicolaypierre.95 , I realised that I might have kernel incompatibility problems, as well!

Kind regards,
Chris

Sorry, but I don’t know enough of Linux to suggest anything in particular. A fresh install of stock Ubuntu of course is an option, but that is the Hammer approach I am afraid.

No problem there! Let’s not forget that we are talking about a VM here, and I need to get this sorted before I do anything else. If you have a specific distribution in mind and/or can find any suggestions in the meantime, they would be much appreciated!

OK, so with a clean run, I got the following report (yes, the error is still the same):
nvidia-bug-report.log (3.0 MB)

The following is my command history since rebuilding the VM:

gcc --version
apt install gcc
reboot
uname -r
sudo apt-get install -y nvidia-headless-550-server-open
sudo apt-get install -y nvidia-open
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get updates
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-5
sudo apt-get install nvidia-gds-12-5
nvidia-smi
apt install nvidia-utils-550-server
nvidia-smi
nvidia-bug-report.sh

I am afraid you cannot use the Open Source kernel driver with the P2000. Pascal is a bit too old, it needs to be Turing or newer.

Aug 23 15:03:39 edge-3-mshed-gpu kernel: [ 1535.232457] NVRM: The NVIDIA GPU 0000:00:05.0 (PCI ID: 10de:1c30)
Aug 23 15:03:39 edge-3-mshed-gpu kernel: [ 1535.232457] NVRM: installed in this system is not supported by open
Aug 23 15:03:39 edge-3-mshed-gpu kernel: [ 1535.232457] NVRM: nvidia.ko because it does not include the required GPU
Aug 23 15:03:39 edge-3-mshed-gpu kernel: [ 1535.232457] NVRM: System Processor (GSP).
Aug 23 15:03:39 edge-3-mshed-gpu kernel: [ 1535.232457] NVRM: Please see the 'Open Linux Kernel Modules' and 'GSP
Aug 23 15:03:39 edge-3-mshed-gpu kernel: [ 1535.232457] NVRM: Firmware' sections in the driver README, available on
Aug 23 15:03:39 edge-3-mshed-gpu kernel: [ 1535.232457] NVRM: the Linux graphics driver download page at
Aug 23 15:03:39 edge-3-mshed-gpu kernel: [ 1535.232457] NVRM: www.nvidia.com.
1 Like

Hello team,

I am facing the same issue in Ubuntu and I have attached the log file.I followed the instructions provided by GCP.
nvidia-bug-report.log.gz (210.9 KB)
Can you please check and advise how to resolve this. Thanks

Hi there @s_esakkiappan, welcome to the NVIDIA developer forums.

Your log shows a simple explanation:

Aug 28 05:44:48 dev-luminar-gpu kernel: [ 1453.712828] NVRM: API mismatch: the client has the version 560.35.03, but
Aug 28 05:44:48 dev-luminar-gpu kernel: [ 1453.712828] NVRM: this kernel module has the version 535.183.01.  Please
Aug 28 05:44:48 dev-luminar-gpu kernel: [ 1453.712828] NVRM: make sure that this kernel module and all NVIDIA driver
Aug 28 05:44:48 dev-luminar-gpu kernel: [ 1453.712828] NVRM: components have the same version.

For further help please work with GCP directly, they should help you here. Especially since you are working on a DGX system with 8 A100s which definitely also needs other special considerations.

Thanks!

1 Like

This thread is deviating from the original post in 2023, so I am closing it. If you want to ask similar questions, please open another post, ideally in the Linux category. Thanks!