Not able to install NVIDIA Drivers on Ubuntu Server 22.04

We had a Tesla T4 up and running smoothly on our server PowerEdge R640 with Ubuntu Server 18.04. However, after we upgraded our systems to Ubuntu Server 22.04, NVIDIA is not working. The command nvidia-smi outputs the following:

NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver.

After we purge everything, and re-install (multiple times), the server seems to get into boot-loop. The following lines are repeated a lot in kern.log:

Jul 24 05:12:55 nc16 kernel: [327372.079138] nvidia-nvlink: Nvlink Core is being initialized, major device number 509
Jul 24 05:12:55 nc16 kernel: [327372.079145] NVRM: The NVIDIA probe routine was not called for 1 device(s).
Jul 24 05:12:55 nc16 kernel: [327372.089000] NVRM: This can occur when a driver such as:
Jul 24 05:12:55 nc16 kernel: [327372.089000] NVRM: nouveau, rivafb, nvidiafb or rivatv
Jul 24 05:12:55 nc16 kernel: [327372.089000] NVRM: was loaded and obtained ownership of the NVIDIA device(s).
Jul 24 05:12:55 nc16 kernel: [327372.089003] NVRM: Try unloading the conflicting kernel module (and/or
Jul 24 05:12:55 nc16 kernel: [327372.089003] NVRM: reconfigure your kernel without the conflicting
Jul 24 05:12:55 nc16 kernel: [327372.089003] NVRM: driver(s)), then try loading the NVIDIA kernel module
Jul 24 05:12:55 nc16 kernel: [327372.089003] NVRM: again.
Jul 24 05:12:55 nc16 kernel: [327372.089004] NVRM: No NVIDIA devices probed.

However, nouveau is already blacklisted:

File: /etc/modprobe.d/blacklist-nvidia-nouveau.conf

blacklist nouveau
options nouveau modeset=0

File: /etc/default/grub

GRUB_CMDLINE_LINUX_DEFAULT=“nouveau.blacklist=1 quiet splash rdblaclist=nouveau nomodeset”

Here are also some outputs you might find informative:

Kernel:

$uname -r

5.15.0-76-generic

Graphic Devices:

$lspci | grep NVIDIA

3b:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

$lspci | grep VGA

03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. Integrated Matrox G200eW3 Graphics Controller (rev 04)

Here is also the nvidia-bug-report prior to purging everything.

nvidia-bug-report.log (488.4 KB)

Thank you in advance for your help.

@generix sorry to bather you. You seem to get a hold of these issues as your answers helped most of the people having similar problems. I tried a lot of other cases but nothing works. Could you please assist me?

typo: should be rd.blacklist=nouveau
also remove the nomodeset parameter.

Thank you for your reply @Mart.

I corrected the typo and removed the nomodeset parameter. I then rebooted the system. I ran the following commands prior to installing the NVIDIA drivers:

$lsmod | grep nouveau

$lsmod | grep nvidia

Both of which returned nothing. I then executed the following:

$sudo ubuntu-drivers autoinstall

From that point forward the system went on a boot-loop. Any other suggestions?

I took a look at your bug report.
Looks like you used the .run file installer after a distro package was already installed.
That usually creates a mess.

1: So use the same .run file again with the --uninstall parameter.
2: Purge the distro package you have installed with sudo apt purge.
3: Make sure you have the kernel headers installed. sudo apt install linux-headers-$(uname -r).
4: Do apt search nvidia-driver to get a list of available drivers.
5: Install your choice (version 5.525 and above are current versions) with sudo apt install DRIVER_VERSION.

I prefer to use apt, as it gives you an output with information to work on.

Thank you again @Mart.

I followed your instructions but no luck.

I first tried to execute the .run file with the --uninstall parameter but got the message:

There is no NVIDIA driver currently installed.

I then executed the following:

$sudo apt-get remove --purge '^nvidia-.*'
$sudo apt autoremove nvidia* --purge

I then updated the kernel $sudo apt install linux-headers-$(uname -r).

And then I run $apt search nvidia-driver, which gave me a lot of options:

nvidia-driver-535/unknown,now 535.54.03-0ubuntu1 amd64 [installed]
NVIDIA driver metapackage

nvidia-driver-535-open/jammy 535.86.05-0ubuntu0~gpu22.04.1 amd64
NVIDIA driver (open kernel) metapackage

nvidia-driver-535-server/jammy-updates,jammy-security 535.54.03-0ubuntu0.22.04.1 amd64
NVIDIA Server Driver metapackage

nvidia-driver-535-server-open/jammy-updates,jammy-security 535.54.03-0ubuntu0.22.04.1 amd64
NVIDIA driver (open kernel) metapackage

I went with but I went with nvidia-driver-535.

After rebooting, I got another boot-loop of the following keep repeating:

NVRM: The NVIDIA probe routine was not called for 1 device(s).
          Starting Load Kernel Module chromeos_pstore ...
          Starting Load Kernel Module efi_pstore...
          Starting Load Kernel Module pstore_blk...
          Starting Load Kernel Module pstore_zone...
          Starting Load Kernel Module ramoops...
[ OK ]    Finished Load Kernel Module efi_pstore.
[ OK ]    Finished Load Kernel Module pstore_blk.
[ OK ]    Finished Load Kernel Module pstore_zone.
[ OK ]    Finished Load Kernel Module ramoops.

Because of the loop I couldn’t get into the server, so I restarted and booted the 5.15.0-76-generic instead of 5.15.0-78-generic. I created a new nvidia-bug-report.log (489.2 KB) which I attach here.

Any clues on what is going on?

None of those take into account the libnvidia* files.

Unfortunately the bug report is not very helpful, because it seems persistent logging is not enabled with journald.

Might help there.

Also please show the output of dpkg -l |grep nvidia

Also look for blacklist files:
grep nvidia /etc/modprobe.d/* /lib/modprobe.d/*

nvidiafb should be blacklisted!
nvidia should not!

Reboot and create a new report.

I followed the instruction on the link you provided and set the storage as persistent in journald.

The output of $dpkg -l | grep nvidia is:

ii libnvidia-cfg1-535:amd64 535.54.03-0ubuntu1 amd64 NVIDIA binary OpenGL/GLX configuration library
ii libnvidia-common-535 535.54.03-0ubuntu1 all Shared files used by the NVIDIA libraries
ii libnvidia-compute-535:amd64 535.54.03-0ubuntu1 amd64 NVIDIA libcompute package
ii libnvidia-decode-535:amd64 535.54.03-0ubuntu1 amd64 NVIDIA Video Decoding runtime libraries
ii libnvidia-encode-535:amd64 535.54.03-0ubuntu1 amd64 NVENC Video Encoding runtime library
ii libnvidia-extra-535:amd64 535.54.03-0ubuntu1 amd64 Extra libraries for the NVIDIA driver
ii libnvidia-fbc1-535:amd64 535.54.03-0ubuntu1 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-gl-535:amd64 535.54.03-0ubuntu1 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii nvidia-compute-utils-535 535.54.03-0ubuntu1 amd64 NVIDIA compute utilities
ii nvidia-dkms-535 535.54.03-0ubuntu1 amd64 NVIDIA DKMS package
ii nvidia-driver-535 535.54.03-0ubuntu1 amd64 NVIDIA driver metapackage
ii nvidia-kernel-common-535 535.54.03-0ubuntu1 amd64 Shared files used with the kernel module
ii nvidia-kernel-source-535 535.54.03-0ubuntu1 amd64 NVIDIA kernel source package
ii nvidia-prime 0.8.17.1 all Tools to enable NVIDIA’s Prime
ii nvidia-settings 535.54.03-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
ii nvidia-utils-535 535.54.03-0ubuntu1 amd64 NVIDIA driver support binaries
ii screen-resolution-extra 0.18.2 all Extension for the nvidia-settings control panel
ii xserver-xorg-video-nvidia-535 535.54.03-0ubuntu1 amd64 NVIDIA binary Xorg driver

Also, $grep nvidia /etc/modprobe.d/* /lib/modprobe.d/* returned:

/etc/modprobe.d/blacklist-framebuffer.conf:blacklist nvidiafb

I created another nvidia-bug-report.log.gz (134.5 KB).

Well the bug report is much more informative as before.

Just to make sure you tried to boot into the …78 kernel before creating the report?
journalctl -b1 still shows the module conflict message.

I’m sorry, maybe we barked the wrong tree.
I looked up the supported GPUs list of the driver and didn’t find the Tesla T4 there.

But if you search for the T4 drivers, it gives you v460.
Updated last in 2021. So I guess that’s the driver download page mess, I read very often about.

Lists driver v535. So I guess that would be accurate.

lspci shows the vfio-pci driver in use for the card.
Maybe that is interfering?
Could you try to disable the iommu and passthrough stuff please (remove all the non standard stuff)?
Edit /etc/default/grub and run sudo update-grub.

Also I found this, which might be worth a try (after purging the currect driver with apt purge '*nvidia*'):

I run the nvidia-bug-report from the ...76 kernel, since I cannot access the ...78 due to the boot-loop. I also tried entering with safety-boot but no luck.

I don’t quite get what you are referring to here:

lspci shows the vfio-pci driver in use for the card.
Maybe that is interfering?
Could you try to disable the iommu and passthrough stuff please (remove all the non standard stuff)?
Edit /etc/default/grub and run sudo update-grub.

In the /etc/default/grub there is the line:

GRUB_CMDLINE_LINUX="console=tty1 console=ttyS1,115200n8 consoleblank=0 intel_iommu=on vfio_iommu_type1.allow_unsafe_interrupts=1 vfio-pci.ids=10de:1eb8"

In which, to disable the iommu I set it to intel_iommu=off. The $sudo upgrade-grub gives the following:

Sourcing file /etc/default/grub' Sourcing file /etc/default/grub.d/50-curtin-settings.cfg’
Sourcing file `/etc/default/grub.d/init-select.cfg’
Generating grub configuration file …
Found linux image: /boot/vmlinuz-5.15.0-78-generic
Found initrd image: /boot/initrd.img-5.15.0-78-generic
Found linux image: /boot/vmlinuz-5.15.0-76-generic
Found initrd image: /boot/initrd.img-5.15.0-76-generic
Warning: os-prober will not be executed to detect other bootable partitions.
Systems on them will not be added to the GRUB boot configuration.
Check GRUB_DISABLE_OS_PROBER documentation entry.
done

After updating the grub, I rebooted the machine, which once again went into boot-loop.

Will purge the drivers and follow the NVIDIA Driver Installation Quickstart. However, this guide says it is for Ubuntu 16.04 and 18.04. I will attach the results here once I am done.

I meant try to boot into …78.
Then into …76 and create the report.
journal entries from the …78 kernel will show up in the report.

For testing remove all of these (I don’t know about the console entries, but I must assume you had your reason to put them there. The iommo and vfio-pci parameters are for passthrough, guessing that is what you want to do…)
update-grub after that.
boot into 78. boot into 76 - create report.

Only guessing, that the documentation maybe is not up to date.

Hey @Mart,

I removed the parameters in the /etc/default/grub as discussed and not only did it boot to 5.15.0-78-generic but also the $nvidia-smi works fine.

I don’t quite get why this happened though. I also did another reboot just to make sure.

Thank you very much for your help.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.