A6000 nvidia-smi "couldn't communicate with the NVIDIA driver" ubuntu22

Hi! I switched to an A6000 from a 1050ti in our server. I ran into this thread RTX A6000 on Ubuntu 20.04 - SMI: No Devices Were Found
with very similar issue, especially the
[ 0.201660] pci 0000:2d:00.0: BAR 1: no space for [mem size 0x1000000000 64bit pref] type of errors in the

Consequently, I:

  1. Switched from CSM to UEFI (or better said, deactivated CSM capabilities, secureboot is off)
  2. activated the “above 4G decoding”
  3. used displaymodeselector to switch to the 256mb setting (but imho this is default and was activated already).

I further tried both 515 & 525 drivers (always completely purged drivers, installed with apt)

Any help is greatly appreciated!!

edit: Typically I use the server headless, but on a plugged in monitor sometimes this error appears:
module: x86/modules: Skipping invalid relocation target, existing value is nonzero for type 1, loc 000000000832a0eef, val ffffffffc337081e module: x86/modules: Skipping invalid relocation target, existing value is nonzero for type 1, loc 000000000832a0eef, val ffffffffc337081e module: x86/modules: Skipping invalid relocation target, existing value is nonzero for type 1, loc 0000000009a257768, val ffffffffc68f881e

Some details:

nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.`

sudo lshw -c display
*-display
description: VGA compatible controller
product: GA102GL [RTX A6000]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:21:00.0
logical name: /dev/fb0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress vga_controller cap_list fb
configuration: depth=32 latency=0 mode=1920x1080 visual=truecolor xres=1920 yres=1080
resources: iomemory:2bf0-2bef iomemory:2bf0-2bef memory:d0000000-d0ffffff memory:2bf70000000-2bf7fffffff memory:2bf68000000-2bf69ffffff ioport:2000(size=128) memory:d1000000-d107ffff
*-display
description: VGA compatible controller
product: ASPEED Graphics Family
vendor: ASPEED Technology, Inc.
physical id: 0
bus info: pci@0000:42:00.0
logical name: /dev/fb0
version: 41
width: 32 bits
clock: 33MHz
capabilities: pm msi vga_controller cap_list rom fb
configuration: depth=32 driver=ast latency=0 resolution=1920,1080
resources: irq:273 memory:d9000000-d9ffffff memory:da000000-da01ffff ioport:4000(size=128) memory:c0000-dffff

dkms status
iser/4.9, 5.4.0-135-generic, x86_64: installed
isert/4.9, 5.4.0-135-generic, x86_64: installed
kernel-mft-dkms/4.15.1, 5.15.0-56-generic, x86_64: installed
kernel-mft-dkms/4.15.1, 5.4.0-135-generic, x86_64: installed
knem/1.1.3.90mlnx1, 5.15.0-56-generic, x86_64: installed
knem/1.1.3.90mlnx1, 5.4.0-135-generic, x86_64: installed
mlnx-ofed-kernel/4.9, 5.4.0-135-generic, x86_64: installed
mlnx-rdma-rxe/4.9, 5.4.0-135-generic, x86_64: installed
nvidia/525.60.11, 5.15.0-56-generic, x86_64: installed
rshim/1.18, 5.15.0-56-generic, x86_64: installed
rshim/1.18, 5.4.0-135-generic, x86_64: installed
srp/4.9, 5.4.0-135-generic, x86_64: installed

lspci -vv | grep -i nvidia
21:00.0 VGA compatible controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation GA102GL [RTX A6000]
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
21:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev a1)
Subsystem: NVIDIA Corporation GA102 High Definition Audio Controller
`
nvidia-bug-report.log.gz (154.6 KB)

Please try removing kernel headers and nvidia driver, then reinstall them.

thanks. I am not sure how to remove the kernel headers (or what exactly that means ;-))

I typically run sudo apt-get purge \*nvidia\*, sudo apt autoremove , reboot + reinstall using sudo apt-get install nvidia-driver-525

Which additional step should I include?

something like
sudo apt remove --purge linux-headers-$(uname -r)

this did not help. I removed both; then I installed first the header, then the driver - or does the order matter?

Edit: added a new vidia bug report
nvidia-bug-report.log.gz (157.6 KB)
n

This thread is the same issue, please check:
https://forums.developer.nvidia.com/t/cannot-get-nvidia-driver-520-515-515-open-or-510-working-in-ubuntu-22-10/231860/30

I tried to follow all the steps in different order, but I did not succeed.

sudo apt purge linux-image-generic -y
sudo apt purge linux-headers-generic -y
sudo apt install linux-image-generic -y
sudo apt install linux-headers-generic -y
sudo apt remove --purge ‘^nvidia-.’ -y
sudo apt remove --purge '^libnvidia-.
’ -y
sudo rm /etc/X11/xorg.conf | true
sudo rm /etc/X11/xorg.conf.d/90-nvidia-primary.conf | true
sudo rm /usr/share/X11/xorg.conf.d/10-nvidia.conf | true
sudo rm /usr/share/X11/xorg.conf.d/11-nvidia-prime.conf | true
sudo rm /etc/modprobe.d/nvidia-kms.conf | true
sudo rm /lib/modprobe.d/nvidia-kms.conf | true
sudo apt update -y && sudo apt full-upgrade -y && sudo apt autoremove -y && sudo apt clean -y && sudo apt autoclean -y
sudo reboot

and then apt install nvidia 525 - doesnt matter if reboot or not.

edit: I tried sudo dkms install --force nvidia/525.60.11 (+ removal) at various steps.

Maybe this helps:

sudo modprobe nvidia --v
modprobe: INFO: …/libkmod/libkmod.c:367 kmod_set_log_fn() custom logging function 0x55e161f12830 registered
insmod /lib/modules/5.15.0-56-generic/updates/dkms/nvidia.ko NVreg_DeviceFileUID=0 NVreg_DeviceFileGID=0 NVreg_DeviceFileMode=0666
modprobe: INFO: …/libkmod/libkmod-module.c:892 kmod_module_insert_module() Failed to insert module ‘/lib/modules/5.15.0-56-generic/updates/dkms/nvidia.ko’: Exec format error
modprobe: ERROR: could not insert ‘nvidia’: Exec format error
modprobe: INFO: …/libkmod/libkmod.c:334 kmod_unref() context 0x55e162f26430 released

Please check if you can work around it by using the liquorix ppa to get another kernel.

thanks for the suggestion. did you ask this because it is an alternative kernel, or because it goes to kernel 6.0?

After installation I see several compatability issues
ERROR (dkms apport): kernel package linux-headers-6.0.0-11.2-liquorix-amd64 is not supported

I am a bit wary to try this - if an older pre 6.0 kernel of liquorix could adress the same idea you had, I’d be more happy to try that

I tried it anyway - I felt lucky and several people are urgently waiting to use this machine.

Maybe unsurprising, it didnt work.

ERROR (dkms apport): kernel package linux-headers-6.0.0-11.2-liquorix-amd64 is not supported
Error! Bad return status for module build on kernel: 6.0.0-11.2-liquorix-amd64 (x86_64)
Consult /var/lib/dkms/nvidia/525.60.11/build/make.log for more information.
dpkg: error processing package nvidia-dkms-525 (–configure):
installed nvidia-dkms-525 package post-installation script subprocess returned error exit status 10
Setting up libnvidia-encode-525:amd64 (525.60.11-0ubuntu0.22.04.1) …
dpkg: dependency problems prevent configuration of nvidia-driver-525:
nvidia-driver-525 depends on nvidia-dkms-525 (<= 525.60.11-1); however:
Package nvidia-dkms-525 is not configured yet.
nvidia-driver-525 depends on nvidia-dkms-525 (>= 525.60.11); however:
Package nvidia-dkms-525 is not configured yet.

dpkg: error processing package nvidia-driver-525 (–configure):
dependency problems - leaving unconfigured
Processing triggers for bamfdaemon (0.5.6+22.04.20220217-0ubuntu1) …
Rebuilding /usr/share/applications/bamf-2.index…
No apport report written because the error message indicates its a followup error from a previous failure.
Processing triggers for desktop-file-utils (0.26-1ubuntu3) …
Processing triggers for gnome-menus (3.36.0-1ubuntu3) …
Processing triggers for libc-bin (2.35-0ubuntu3.1) …
Processing triggers for man-db (2.10.2-1) …
Processing triggers for mailcap (3.70+nmu1ubuntu1) …
Processing triggers for initramfs-tools (0.140ubuntu13) …
update-initramfs: Generating /boot/initrd.img-6.0.0-11.2-liquorix-amd64
W: Possible missing firmware /lib/firmware/ast_dp501_fw.bin for module ast
Errors were encountered while processing:
nvidia-dkms-525
nvidia-driver-525
needrestart is being skipped since dpkg has failed
E: Sub-process /usr/bin/dpkg returned an error code (1)

nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Something is really broken in your build system, I wanted you to install the liquorix kernel because it’s a different, compatible kernel, frequently used.
Now dkms says it’s incompatible and build fails.
Please attach the referenced make.log

DKMS make.log for nvidia-525.60.11 for kernel 6.0.0-11.2-liquorix-amd64 (x86_64)
Tue Dec 6 11:29:28 PM CET 2022
make[1]: Entering directory ‘/usr/src/linux-headers-6.0.0-11.2-liquorix-amd64’
test -e include/generated/autoconf.h -a -e include/config/auto.conf || (
echo >&2;
echo >&2 " ERROR: Kernel configuration is invalid.“;
echo >&2 " include/generated/autoconf.h or include/config/auto.conf are missing.”;
echo >&2 " Run ‘make oldconfig && make prepare’ on kernel src to fix it.";
echo >&2 ;
/bin/false)

ERROR: Kernel configuration is invalid.
include/generated/autoconf.h or include/config/auto.conf are missing.
Run ‘make oldconfig && make prepare’ on kernel src to fix it.

make[1]: *** [Makefile:741: include/config/auto.conf] Error 1
make[1]: Leaving directory ‘/usr/src/linux-headers-6.0.0-11.2-liquorix-amd64’
make: *** [Makefile:82: modules] Error 2

That doesn’t help, those messages are always displayed. Please attach the full log.

If I remember correctly, this was the whole logfile in /var/lib/dkms/nvidia/525.60.11/build/make.log - maybe you meant a different one?

Unfortunately, I can’t check - it seems the headless server didn’t like my last reboot + kernel mix and is maybe stuck in Grub or similar. Have to ask admin tomorrow to check.

I will further ask Admin to try a clean install of Ubuntu22 on a separate harddrive.

The file
/usr/src/linux-headers-6.0.0-11.2-liquorix-amd64/include/config/auto.conf
should exist, otherwise something failed installing the headers.