(K)Ubuntu with NVidia GeForce RTX 3060 Ti freezes during boot

Hi all,

after having spent four days of trying virtually every single idea and suggestion that I came across, I now admit to being utterly and totally dumbfounded by the problem that I’m facing. I would appreciate, if somebody could point me in the right direction. My problem is as follows:

I’m currently setting up a new computer with the following specs:

  • Mainboard: Mag B560 Tomahawk WIFI by MSI
  • Graphics card: GeForce RTX 3060 Ti by KFA2
  • CPU: Intel i7-11700KF
  • Three screens, two of which are connected via DisplayPort, one via HDMI.
  • Secure Boot is disabled. Fast Boot is disabled.

I installed the latest version of KUbuntu Linux (Jammy Jellyfish 22.04, as of this writing) and initially managed to get the system running more or less directly out-of-the-box. However, after having installed a few updates and lots of programs, the problems started:

In a nutshell, it has now turned out to be virtually impossible to get the system to boot, as it always freezes before switching to the graphical login screen. From the beginning I suspected the NVidia drivers to be at least part of the problem and started to boot by removing quiet splash from the GRUB entry and added noplymouth nomodeset to better understand what’s happening. I then noticed that the boot process mostly froze, when the USB ports where queried and managed to locate a problem with one of my tablets. By moving this from a USB 2 to a USB 3 port, the problem went away, i.e. the boot process no longer stops there and runs roughly until the graphical login screen should appear.

I then read about a bug related to having screens connected to a DisplayPorts in some versions of the NVidia driver (sorry, I’m not allowed to post more than one link here). However, the versions mentioned in these threads don’t seem to be relevant for my card anymore. What’s more, I also tried booting with only one monitor plugged into a HDMI port, without any luck.

I then proceeded to trying different versions of the driver with varying results:

  • 450-server (from the repositories, mentioned in one of the threads above): resulted in lots of errors “Failed to start nvidia persistence daemon”.
  • 460.84 (from the NVidia site): fails to compile
  • 470.57 (from the NVidia site): fails to compile
  • 470.63 (from the NVidia site): freezes before switching to the graphical login screen
  • 470.86 (from the NVidia site): freezes before switching to the graphical login screen
  • 470.94 (from the NVidia site): freezes before switching to the graphical login screen
  • 470.103 (from the repositories): freezes before switching to the graphical login screen
  • 510.60 (both from the repositories and from the NVidia site): freezes before switching to the graphical login screen.

In most cases, I managed to at least reach a console by booting to run level 3 (adding 3 to GRUB), where I could then uninstall the driver. However, in some cases, I had to explicitly blacklist the NVidia drivers (adding module_blacklist=nvidia to GRUB): In general, drivers with versions 510.xx seem to cause this more often, but I can’t quite put my finger on the problems.

I then tried to boot an older kernel and switched from 5.15.0-27 to 5.15.0-25 (no other kernel is available). At first, this seemed to improve things, as I managed to get the system to boot using drivers versions 470.63 and onwards. Alas, my joy was shortlived, as the problems came back. I now seem to be able to boot to a working system once every, say, 50 times. In these cases, however, only one of my screens is detected.

That made me suspect it might have something to do with timing. I came across a thread that suggested to “load the Nvidia kernel modules a little earlier in the boot process” by modifying /etc/modules. But this didn’t seem to have any effect in my case.

The fact that only one of my three monitors was detected, whenever I was able to boot, made me suspect that the driver didn’t even get loaded. And indeed: I now noticed “(EE) Failed to load module nvidia (module does not exist, 0)” in /var/logs/Xorg.0.log, so I searched for the driver and found it was installed in /usr/lib/x86_64-linux-gnu/nvidia/xorg. At that point I had already installed, purged and reinstalled different versions of the drivers often times, so I wondered, why none of these steps had set the correct ModulePath. Nonetheless, I manually added the line ModulePath "/usr/lib/x86_64-linux-gnu/nvidia/xorg,/usr/lib/xorg/modules" to /etc/X11/xorg.conf and managed to at least get the driver to load during boot. Unfortunately, I now get the error "(EE) NVIDIA: Failed to initialize the NVIDIA kernel module". Checking dmesg doesn’t contain any helpful hints.

Lastly, in my desperation I tried a simple sudo ubuntu-drivers autoinstall (that installed driver version 510.60). The system first booted into the latest kernel (-27, see above), which did not work, but switching back to the previous kernel version (-25) the system immediately booted with everything working perfectly. However, after only a single reboot, the problems came back and now even the autoinstall doesn’t have any effect anymore.

By the way:

  • I made sure the nouveau driver is blacklisted.
  • I tried to recreate the file xorg.conf using sudo nvidia-xconfig often and finally even deleted the whole directory /etc/X11. None of this has any effect.

I’ve now run out of ideas and I’d be happy about all hints and suggestions to get my system up and running.
Thanks in advance.

nvidia-bug-report.log.gz (69.2 KB)

Please try using the driver from the graphics driver ppa.

I have the PPA installed and whenever I mentioned drivers “from the repositories” I actually meant drivers from this PPA.

So, for lack of a better idea, I finally reinstalled the whole system (KUbuntu). And decided to activate the loading of “third party” drivers right from the start.

After the installation had finished and I rebooted the computer, the boot process froze before reaching the graphical login (driver version 510)! I then blacklisted both the nvidia and nouveau drivers and managed to at least reach the graphical interface, but it seems there is something very fundamentally wrong.

Can anybody think of something I could try next to resolve this? (By the way: The system is booting without any problems in Windows. So, this has to be a software issue.)

  1. update bios
  2. check if this is power related by locking performance
    https://forums.developer.nvidia.com/t/how-to-lock-powermizer-performance-level-to-the-lowest-level-with-optimus/200078/6?u=generix
    NB: the linux driver has unlocked boost clocks so you will notice (hardware) failures not apparent in windows

Thanks for the suggestions. Indeed, I ended up updating the BIOS, but that didn’t seem to have any effect. I also installed the package mainline, which allowed me to install older and newer kernels than were available in the current version of (K)Ubuntu. Tested a few of them, unfortunately without any luck.

However, I can now happily report that I finally managed to get the system to boot without any problems and run smoothly. The problem seems to have been that the graphics card was plugged into a PCIe slot that MSI calls “PCIe Steel Armor” (also described as “Lightning Gen 4 PCIe”): At first I didn’t even recognize this as a PCIe slot, as it seems to have a silver cover. However, as soon as I had switched the card to a “normal” PCIe slot, all problems were gone immediately - even those that didn’t seem to have anything to do with the graphics card (I was even able to plug my tablet back into the USB 2.0 slot.)

Can anybody think of a reason, why this PCIe Steel Armor slot might be causing these problems?

Anyway, thanks a lot for your help. Much appreciated.

Hi @marcuschristopher,

Based on all the good troubleshooting so far, I’d suggest you reach out to MSI support engineers directly on use and configuration of their Steel Armor slot at SupportCenter | MSI Global - The Leading Brand in High-end Gaming & Professional Creation.