Nvidia-smi not working after ubuntu kernel update

quillivicrobin · April 7, 2023, 7:04am

Hi ,
I am on ubuntu 20.04 and after a recent update to kernel 5.15.0-69-generic, the GPU driver was not recognized. I create a bug report :
nvidia-bug-report.log (2.2 MB)

After some research, I manage to make it work using prime-select nvidia. However, this solution make impossible to use the 2 GPU I have (intel and Nvidia). Indeed before the update my nvidia GPU was only use for Deep Learning and the intel for rendering. After I use prime prime select nvidia, intel was never use.
I also try prime-select on-demand but in this case my nvidia GPU was not working…
I pos in ht fist comment a second bug report from the situation prime-select nvidia.
Any help would be appreciated :)

quillivicrobin · April 7, 2023, 11:51am

second bug report :
nvidia-bug-report.log_2.gz (353.3 KB)

MarkusHoHo · April 13, 2023, 1:29pm

Hi @quillivicrobin,

Did you install the NVIDIA driver before or after the kernel upgrade? Or did the kernel upgrade initiate an automatic driver update? The latter is quite often a bit problematic and you should do a manual re-install instead.

On first look at your logs I can only see two entries that might be suspicious. Check these parts in your logs:

warning: the compiler differs from the one used to build the kernel
  The kernel was built by: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
  You are using:           cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

While this should not matter I would still check if cc is simply a symlink to gcc or not and if necessary re-install the driver after fixing this.

Secondly:

avril 06 16:05:50 robin-Precision-7560 nvidia-persistenced[22122]: device 0000:01:00.0 - persistence mode disabled.
avril 06 16:05:50 robin-Precision-7560 nvidia-persistenced[22122]: device 0000:01:00.0 - NUMA memory offlined.
avril 06 16:05:50 robin-Precision-7560 nvidia-persistenced[22122]: PID file unlocked.
avril 06 16:05:50 robin-Precision-7560 nvidia-persistenced[22122]: PID file closed.
avril 06 16:05:50 robin-Precision-7560 nvidia-persistenced[22122]: Shutdown (22122)

persistenced is not running. That also indicates an issue with the driver installation.

Could you clarify one thing for me please?

How did you verify this? ´nvidia-smi´did not show an issue in the second log. Could you do prime-select on-demand, reboot and then add the output of a plain call to nvidia-smi here?

quillivicrobin · April 24, 2023, 10:45am

Hello @MarkusHoHo ,

Thanks for your answer, and sorry for the delay I was offline last week.

In my case, the kernel update did not initiate an automatic update of the driver but after the kernel update, the GPU driver was not working anymore. Then, I try to manually re-install the recommended version (nvidia-driver-525-open) but it did not improve my situation so I re-installed the nvidia-driver-510 and it enable me to use my GPU. Still, I could not specify the usage (intel for everything except deep learning) and nvidia for deep learning.

I would still check if cc is simply a symlink to gcc

How would you do that?

How did you verify this? ´nvidia-smi´did not show an issue in the second log

On the second log, I used prime-select nvidia because the prime-select on-demand did not give me any results.
I follow your advice and it worked after the reboot on` prime-select on-demand the two GPUs are correctly used.
Here is a last bug report:
nvidia-bug-report.log (2.7 MB)

However, after that, all my python environment are broken and even if Toch or Tensorflow detect the GPU they cannot use.
with torch:

ValueError: GPU is not accessible. Was the library installed correctly?

with tensorflow :

2023-04-24 11:29:21.442756: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc

it’s like cuda is not found.

I will continue the investigations :)

MarkusHoHo · April 26, 2023, 11:51am

For the CUDA issue the reason might be that you have changed the driver from 525 to 510 in between. Possibly there is a CUDA toolkit mismatch at that point.

Cleaning this up might be a bit complicated. The best would be to remove CUDA completely and freshly install the toolkit following the installation instructions.

robin.quillivic · April 26, 2023, 2:52pm

Thanks for the reply!
I did a full upgrade to Ubuntu 22.04 and make a fresh install with new driver, cuda, cudnn, etc.
Now this is working fine :)
Thanks for the help !

MarkusHoHo · April 27, 2023, 7:22am

Great to hear that!

Sorry that you had to reinstall so much. Sometimes it really is the safest solution.

system · May 11, 2023, 7:22am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Nvidia-smi not working after ubuntu update Linux nvidia-smi	1	530	April 6, 2023
Nvidia Driver is not working on Ubuntu 22 Linux ubuntu , driver	14	19137	October 28, 2022
2nd GPU not showing in nvidia-smi in Ubuntu 22.04 Linux	4	6272	June 2, 2024
Nvidia-smi “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure …” Linux linux	8	2272	January 12, 2022
Nvidia-smi sometimes work and sometimes doesn't work after reboot ubuntu18.04 Linux	5	2881	August 29, 2020
Can't use any NVIDIA driver on Ubuntu 18.04 (4.15.0-39-generic) Linux	7	20456	October 12, 2021
NVIDIA-SMI no longer works and fresh nvidia-driver installs fail CUDA Setup and Installation cuda , ubuntu	1	1733	January 16, 2024
NVIDIA driver 515.43.04 fails to load on Ubuntu, MSI GS66 with RTX 2070 Linux	4	1054	August 4, 2022
Cannot get the nvidia driver to work Linux kernel , ubuntu	23	2459	May 4, 2024
Nvidia-smi failed to recognize all GPU in Ubuntu 20.04 Linux ubuntu	13	2548	December 19, 2022

Nvidia-smi not working after ubuntu kernel update

Related topics