NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running

I got error “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.”

When I checked current driver with ~$ ubuntu-drivers list , the result is following.

vidia-driver-535-open, (kernel modules provided by linux-modules-nvidia-535-open-oem-22.04c)
nvidia-driver-525-server, (kernel modules provided by nvidia-dkms-525-server)
nvidia-driver-525-open, (kernel modules provided by linux-modules-nvidia-525-open-oem-22.04c)
nvidia-driver-525, (kernel modules provided by linux-modules-nvidia-525-oem-22.04c)
nvidia-driver-535-server, (kernel modules provided by nvidia-dkms-535-server)
nvidia-driver-535, (kernel modules provided by linux-modules-nvidia-535-oem-22.04c)
nvidia-driver-535-server-open, (kernel modules provided by nvidia-dkms-535-server-open)
oem-fix-misc-cnl-backport-iwlwifi-helper
oem-somerville-muk-meta
libcamhal-ipu6ep0

When I checked current compilation tool with ~$ nvcc -V , the result is following.

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

Current situation are Ubuntu 22.04.3 LTS , GA compatible controller: NVIDIA Corporation Device 27bb (rev a1)

Could you team me solution to success “nvidia-smi”?

Best regards,

Hello @user36321 and welcome to the NVIDIA developer forums.

Please run nvidia-bug-report.sh and attach the output to this thread.

Thanks!

Thank you for Reply.

When I run “sudo nvidia-bug-report.sh” ,the output was following.

nvidia-bug-report.sh will now collect information about your
system and create the file ‘nvidia-bug-report.log.gz’ in the current
directory. It may take several seconds to run. In some
cases, it may hang trying to capture data generated dynamically
by the Linux kernel and/or the NVIDIA kernel module. While
the bug report log file will be incomplete if this happens, it
may still contain enough data to diagnose your problem.

If nvidia-bug-report.sh hangs, consider running with the --safe-mode
and --extra-system-data command line arguments.

Please include the ‘nvidia-bug-report.log.gz’ log file when reporting
your bug via the NVIDIA Linux forum (see forums.developer.nvidia.com)
or by sending email to ‘linux-bugs@nvidia.com’.

By delivering ‘nvidia-bug-report.log.gz’ to NVIDIA, you acknowledge
and agree that personal information may inadvertently be included in
the output. Notwithstanding the foregoing, NVIDIA will use the
output only for the purpose of investigating your reported issue.

Running nvidia-bug-report.sh… complete.

Can you catch anything to solve from this ?

Best regards,

nvidia-bug-report.log.gz (134.8 KB)

sorry for my misunderstanding.
I attached output file.
can you anything from this ?

Best regards,

Have a look at the output of :

 /var/log/kern.log:
Aug 22 10:35:10 taihi-Precision-5680 kernel: [    3.899959] nvidia-nvlink: Nvlink Core is being initialized, major device number 505
Aug 22 10:35:10 taihi-Precision-5680 kernel: [    3.951518] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  525.105.17  Tue Mar 28 18:02:59 UTC 2023
Aug 22 10:35:10 taihi-Precision-5680 kernel: [    3.977321] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  525.105.17  Tue Mar 28 22:18:37 UTC 2023
Aug 22 10:35:10 taihi-Precision-5680 kernel: [    4.003195] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
Aug 22 10:35:11 taihi-Precision-5680 kernel: [    4.947435] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1
Aug 22 10:35:11 taihi-Precision-5680 kernel: [    5.149596] nvidia-uvm: Loaded the UVM driver, major device number 503.
[...]
Aug 22 10:44:54 taihi-Precision-5680 kernel: [  588.333187] NVRM: API mismatch: the client has the version 525.125.06, but
Aug 22 10:44:54 taihi-Precision-5680 kernel: [  588.333187] NVRM: this kernel module has the version 525.105.17.  Please
Aug 22 10:44:54 taihi-Precision-5680 kernel: [  588.333187] NVRM: make sure that this kernel module and all NVIDIA driver
Aug 22 10:44:54 taihi-Precision-5680 kernel: [  588.333187] NVRM: components have the same version.
Aug 22 10:58:46 taihi-Precision-5680 kernel: [    5.127135] nvidia-nvlink: Nvlink Core is being initialized, major device number 505
Aug 22 10:58:46 taihi-Precision-5680 kernel: [    5.176767] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.86.05  Fri Jul 14 20:46:33 UTC 2023
Aug 22 10:58:46 taihi-Precision-5680 kernel: [    5.197558] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  535.86.05  Fri Jul 14 20:20:58 UTC 2023
Aug 22 10:58:46 taihi-Precision-5680 kernel: [    5.230160] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
Aug 22 10:58:47 taihi-Precision-5680 kernel: [    6.144563] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1
Aug 22 10:58:47 taihi-Precision-5680 kernel: [    6.454275] nvidia-uvm: Loaded the UVM driver, major device number 503.
Aug 22 10:58:49 taihi-Precision-5680 kernel: [    7.624670] NVRM: API mismatch: the client has the version 535.98, but
Aug 22 10:58:49 taihi-Precision-5680 kernel: [    7.624670] NVRM: this kernel module has the version 535.86.05.  Please
Aug 22 10:58:49 taihi-Precision-5680 kernel: [    7.624670] NVRM: make sure that this kernel module and all NVIDIA driver
Aug 22 10:58:49 taihi-Precision-5680 kernel: [    7.624670] NVRM: components have the same version.

You have 4 different NVIDIA drivers installed. I think you should fix that first.

Please purge ALL NVIDIA drivers cleanly. That means reboot into console mode, unload all NVIDIA related kernel modules, then remove ALL NVIDIA related driver packages. The README file of the downloadable driver packages has detailed instructions and a list of files to look for.

Then reboot again, again into console mode.

Then install ONE correct driver for your system. Either you install the recommended proprietary NVIDIA driver through the Ubuntu Software Center’s Third Party application tab, or you download it as a .run file from Official Drivers | NVIDIA directly, it should be 535.104.05 as we speak.

During installation make sure to follow instructions exactly, especially if you have secure boot enabled and need to authenticate the driver.

I hope this will help you resolve your issues.

Thanks!

Sorry for the late reply. Is “sudo apt-get purge nvidia-*” the command to clean the Nvidia driver?
After running this, when I run “cat /proc/driver/nvidia/version”, it returns that no such file or directory exists.
Is this a situation where the drive is clean?
Sorry for the amateur question

No worries, and definitely not an amateur question.

First make sure you are not using any driver modules. The Linux installation guide has a paragraph “Before you begin” which you should follow, even for the purge.

Then make sure no NVIDIA modules are loaded anymore, use lsmod | grep nvidia to check that. If there are still modules loaded, unload then with modprope -r or rmmod.

After that do the purge command you mentioned.

I hope that helps!

When I run “lsmod | grep nvidia”, there is no response, so I understand that the NVIDIA module is not loaded.

“sudo apt update”
“sudo apt install nvidia-driver-535”
Should I avoid using the method of loading the Driver?
After running “sudo apt update”, it says that there are 36 packages that can be upgraded. Will it be a problem if I leave it as is?

I saw a post that says it’s better not to run .run files, but is that a problem?

In fact, I tried running the downloaded .run file using “sudo sh”, but an error message popped up asking me to stop the x server before installation, so I couldn’t complete it. It would be helpful if you could tell me the installation steps.

You can use the sudo apt version to install the driver. Either that or the installation through the Third-Party Apps tab of the Software Center in Ubuntu. Both will install the driver package included as part of the distribution, which should work without issues.

If you use the software center option, make sure to use the proprietary driver and NOT Open Source kernel modules.

If you use the sudo apt version, you need to be in terminal console mode only otherwise you will likely get the same X server error message.

After installation make sure to reboot! But I mentioned this already Aug 29th.

  • When I input “ubuntu-drivers devices”, the following was returned.
    It seems that the extra driver has not been deleted yet. Could you please tell me the correct way to delete it?

== /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0 ==
modalias: pci:v000010DEd000027BBsv00001028sd00000C11bc03sc00i00
vendor: NVIDIA Corporation
driver : nvidia-driver-525-open - third-party non-free
driver : nvidia-driver-525 - third-party non-free
driver : nvidia-driver-535 - third-party non-free recommended
driver : nvidia-driver-535-open - third-party non-free
driver : xserver-xorg-video-nouveau - distro free builtin

== /sys/devices/pci0000:00/0000:00:14.3 ==
modalias: pci:v00008086d000051F1sv00008086sd00004090bc02sc80i00
vendor : Intel Corporation
driver : oem-fix-misc-cnl-backport-iwlwifi-helper - third-party free

== /sys/devices/pci0000:00/0000:00:1f.4 ==
modalias : pci:v00008086d000051A3sv00001028sd00000C11bc0Csc05i00
vendor : Intel Corporation
driver : oem-somerville-muk-meta - third-party free

== /sys/devices/pci0000:00/0000:00:05.0 ==
modalias: pci:v00008086d0000A75Dsv00001028sd00000C11bc04sc80i00
vendor : Intel Corporation
driver : libcamhal-ipu6ep0 - third-party free

・”You can use the sudo apt version to install the driver.”
→Does this refer to the command “sudo apt install nvidia-driver-535”?

・”If you use the software center option, make sure to use the proprietary driver and NOT Open Source kernel modules.”
→I don’t understand this part, so could you please explain it in more detail?

・Is terminal console mode a screen that can be opened with Ctrl+Alt+F4?

Hello again.

No, the command ubuntu-drivers devices' only lists the drivers that are available as part of the Ubuntu distribution and that can be installed using the Ubuntu package manager. It does not tell you which driver version actually is installed. If nvidia-smi`works as expected it will show the currently installed driver version.

If you installed the driver through Ubuntu the packaging manager, you should be able to check with sudo apt list --install to see which driver version was installed.

If you used a different installation option then one way to check for just one driver version being present is to look in /usr/lib/or /usr/lib/x86_64-linux-gnu and check the files and symbolic links with libnvidia in their names to see if they only contain one version number. For example for me on one system this looks like this in /usr/lib/x86_64-linux-gnu:

...
lrwxrwxrwx     1 root root                  26 libnvidia-cfg.so.1 -> libnvidia-cfg.so.525.85.05
lrwxrwxrwx     1 root root                  26 libnvidia-cfg.so.525.85.05
...

and 525.85.05 are the only file versions present.

Yes. But the recommended way is through “Software & Updates - Additional Drivers”.

There are driver packages with openin their names. Do NOT install those. Use “proprietary, tested”

Yes and no. It will look the same, but if you simply switch from the window manager to the console window the graphics driver is still loaded and cannot easily be replaced without issues.
You need to reboot directly into this terminal only, text only mode, not into graphical mode to avoid graphics drivers to be loaded. You can find lots of instructions on that topic online.

Hello

Sorry for late reply.

I tried to install novidia-driver-535 on Software & Update as attachment , but I am not yet able to run “nvidia-smi”

On console mode , I runned “sudo apt-get purge nvidia-*” ,but the result was following.
Loading package list… Done
Creating dependency tree… Done
Reading status information… Done
E: Package nvidia-bug-report.log not found
E: No matching packages found for ‘nvidia-bug-report.log’
E: No packages found with regular expression ‘nvidia-bug-report.log’
E: Package nvidia-bug-report.log.gz not found
E: No matching packages found for ‘nvidia-bug-report.log.gz’
E: No packages were found with regular expression ‘nvidia-bug-report.log.gz’

Is the purge method of the nividia driver wrong?

I am lost. The apt-get command will check in known package names for any that match the expression nvidia-*, but it will never match the name of a local file. Are you sure you used the command as you wrote it and not accidentally by listing the local file names? You could also try this instead:

sudo apt-get remove --purge '^nvidia-.*'

Hello,

When I run “sudo apt-get remove --purge ‘^nvidia-.*’” , I didn’t see error messages , so perhaps , I feel purge done well. However , I saw following messages after reboot.

taihi@taihi-Precision-5680:~$ nvidia-smi
コマンド ‘nvidia-smi’ が見つかりません。次の方法でインストールできます:
sudo apt install nvidia-utils-390 # version 390.157-0ubuntu0.22.04.1, or
sudo apt install nvidia-utils-418-server # version 418.226.00-0ubuntu5~0.22.04.1
sudo apt install nvidia-utils-450-server # version 450.236.01-0ubuntu0.22.04.1
sudo apt install nvidia-utils-470 # version 470.182.03-0ubuntu0.22.04.1
sudo apt install nvidia-utils-470-server # version 470.182.03-0ubuntu0.22.04.1
sudo apt install nvidia-utils-510 # version 510.108.03-0ubuntu0.22.04.1
sudo apt install nvidia-utils-515 # version 515.105.01-0ubuntu0.22.04.1
sudo apt install nvidia-utils-515-server # version 515.105.01-0ubuntu0.22.04.1
sudo apt install nvidia-utils-525 # version 525.105.17-0ubuntu0.22.04.1
sudo apt install nvidia-utils-525-server # version 525.105.17-0ubuntu0.22.04.1
sudo apt install nvidia-utils-530 # version 530.41.03-0ubuntu0.22.04.2
sudo apt install nvidia-utils-510-server # version 510.47.03-0ubuntu3
sudo apt install nvidia-340 # version 340.108-0ubuntu2
sudo apt install nvidia-utils-435 # version 435.21-0ubuntu7
sudo apt install nvidia-utils-440 # version 440.82+really.440.64-0ubuntu6

After that , I tried to install novidia-driver-535 on Software & Update by same perivous method ,again , but I saw same message again as following.

taihi@taihi-Precision-5680:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

My procedure is bad ?

Hello Markus san,

Long time no see.
I still have problem to delete drivers.

When I sent “sudo apt-get purge nvidia-driver-545” , I couldn’t delete the driver because of following error.

dpkg-divert: Error: libcamhal-common's diversion from /etc/modprobe.d/v4l2-relayd.conf to /etc/modprobe.d/v4l2-relayd.conf.orig' is libcamhal-ipu6ep’ Conflicts with ‘divert’ from /etc/modprobe.d/v4l2-relayd.conf to /etc/modprobe.d/v4l2-relayd.conf.orig with -common

Could you teach me a solution of this error ?

Best regards,