510.53-RTX3070-Ubuntu 20.04 Blank Screen and dkms unsupported kernel when reinstalling

Hi,

I am unable to boot into Ubuntu20.04 after an automatic nvidia update (similarly to some other posts here on regarding the recent updates 510 drivers).

In short:
I have tried with two different kernels, my original kernel (5.13.0-051300-generic) and a liquirix recent kernel (5.16.0-11.1-liquirix-amd64). In both cases, when using “sudo ubuntu-drivers autoinstall” nvidia-dkms-510 tells me that my kernel headers are not supported. (I have also tried 5.4.0.99, but it doesn’t boot at all in that kernel)

Bug reports:
5.16.0-11.1-liquirix-amd64: Ubuntu Pastebin
5.13.0-051300-generic: Ubuntu Pastebin

Any help is greatly appreciated.

More detail:
I have tried to fully uninstall the nvidia drivers (after checking with “dpkg -l | grep nvidia” there are no packages listed). Then when I try to install with

sudo ubuntu-drivers autoinstall

It tries to install the 510 driver, but nvidia-dkms-510 tells me that my kernel headers are not supported. Then it fails the installation since dmks is not configured.

I have also tried the steps given in related posts (I can only post one here), but nothing has worked for me so far.

The kernel 5.13.0-051300-generic was built for 21.10 so is incompatible. Please remove it.
The liquorix kernel is compatible but the headers package seems to be either broken or not installed at all. Please boot to the liquorix kernel and run
sudo apt install --reinstall linux-headers-$(uname -r)

Thanks for your response. Reinstalling fixed this error. But it now gives this error:
ERROR (dmks apport): binary package for evdi: 5.2.14 not found

Installation seems to continue however. So I tried rebooting, but it still gives a blank screen with (maybe unrelated):
USBC000:00: failed to reset PPM!
USBC000:00: PPM init failed (-110)

Running nvidia-smi in tty gives:
Failed to initialize NVML: Driver/library version mismatch

Updated bug report: Ubuntu Pastebin

[ 40.265039] NVRM: API mismatch: the client has the version 510.54, but
NVRM: this kernel module has the version 510.47.03. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.

Seems you have installed different driver versions over one another. Please uninstall the runfile version and check if you can purge/reinstall the repo packages.

I ran <runinstall_file> --uninstall (which is the 510.47 version, it seems that that is also the driver being installed by ubuntu-drivers autoinstall).

It tells me there is no nvidia driver installed, then exits.

I cannot find any trace of the 510.54 driver. Is there any other way I can find / remove it?

Please check for traces:

dpkg -l |grep nvidia
ls -l /lib/x86_64-linux-gnu/*nvidia*
ls -l /usr/lib/x86_64-linux-gnu/*nvidia*

There are indeed two versions still present in the last two locations (command outputs below)

dpkg -l |grep nvidia → Ubuntu Pastebin
ls -l /lib/x86_64-linux-gnu/nvidiaUbuntu Pastebin
ls -l /usr/lib/x86_64-linux-gnu/nvidiaUbuntu Pastebin

Is there a way to remove the .54 version?

I wonder where that comes from. Maybe purge the packages first, then use the 510.54 runfile installer to overwrite the .54 version, then use it again to uninstall, check if all files are gone, then install the repo driver again.

I tried your suggestion and it did install (giving some error “binary package for nvidia: 510.54 not found”, but seemingly installing correctly), then uninstalling finds the driver and completes successfully, but, the files remain in the last two locations… (Example of usr/lib/… below). So it does not seem to remove the files related to 510.54 properly.

https://paste.ubuntu.com/p/vG4xk5MmhT/

Any other ideas?

Try using the runfile again, but this time with -b to just overwrite the files.

Thanks, your suggestion worked and the .54 is now removed. I ran sudo ubuntu-drivers autoinstall to install the 510.47 driver.

However, it still doesn’t boot (just a blank screen and the white bar blinking again). Although the NVidia driver seems to be installed now as nvidia-smi returns the 510.47 driver working correctly.

Updated bug report: Ubuntu Pastebin

Any idea how I can make it boot now?

Looks like the nvidia driver is loading too late, please try embedding it into the initrd.

What would be the command for doing so? I can’t find anything online.

I did try just to update the initrd file with sudo update-initramfs -u -k $(uname -r), that hasn’t solved the issue.

Hi, thanks for your support! Most parts of my system are up and running again with Nvidia drivers. Thanks!

I am however still stuck on one issue with regards to the NVidia OpenCL libraries. Some applications are throwing:
error while loading shared libraries: libOpenCL.so.1: cannot open shared object file: No such file or directory

It seems that I am missing libOpenCL.so.1.0.0. There is a symlink, but the actual library is missing.

Looking online, it seems that it should be installed with the libnvidia-compute-510 package, but it is missing for me. Is there anything I can do to fix it?

That’s only the OpenCL loader, doesn’t belong to the nvidia driver but ocl-icd. It’s just missing the compatibility link
sudo ln -s /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0 /usr/lib/x86_64-linux-gnu/libOpenCL.so.1

I have tried your solution. The problem is that there is no file /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0, i.e., the library is missing as in the image below:

image

I have tried to reinstall ocl-icd-opencl-dev, but that only seems to recreate the symbolic link libOpenCL.so without providing the library (libOpenCL.so.1.0.0) itself. Do you know how I can retrieve the underlying library?

Sorry, I misread your ls output. The library is provided by the ubuntu package ocl-icd-libopencl1
https://packages.ubuntu.com/focal/ocl-icd-libopencl1

Reinstalling that package did the trick! Thanks!