Out of memory issue after installing nvidia driver

Hi nvidia team.

I have used 2 Geforce RTX 4060 TI in server for deep learning for 4 months. But as I try using the 2 gpus in one docker container, “end kernel panic” error occurred. Since it happens upon booting, it is hard for me to troubleshoot on my end.

Here are the actions that I took so far.

  1. Select “advanced options for ubuntu” Choose 5.15.0-119 (recovery mode)-> Issue reappears

  2. After formatting, the following steps were attempted → Issue reappears

  • sudo apt-get install ubuntu-drivers-common
  • ubuntu-drivers devices | grep recommended
  • sudo apt-get install nvidia-driver-550
  • sudo reboot now
  1. Upon reboot, select “ubuntu (default)” kernel instead of “advanced options for ubuntu” in the grub menu
  • Initially, it worked fine, but the issue reappeared after updating the nvidia-driver.

4.I mounted ubuntu–vg-ubuntu–lv via a live boot usb and tried deleting nvidia driver after the issue. But the error reappears.

It would be appreciated if I could be advised. Thank you!

Updates

I managed to resolve the issue by installing a different version of the NVIDIA driver. However, it’s still puzzling because the previous driver version should have been compatible with my GPU (RTX 4060 Ti), according to the NVIDIA driver compatibility check website: NVIDIA Driver Downloads.

Below is the output of the ubuntu-drivers devices command, which lists the available drivers. I initially chose the nvidia-driver-550 as it was marked as the “distro non-free recommended” option:

driver   : nvidia-driver-550 - distro non-free recommended
driver   : nvidia-driver-535-server-open - distro non-free
driver   : nvidia-driver-550-open - distro non-free
driver   : nvidia-driver-535-server - distro non-free
driver   : nvidia-driver-535 - distro non-free
driver   : nvidia-driver-545-open - distro non-free
driver   : nvidia-driver-545 - distro non-free
driver   : nvidia-driver-535-open - distro non-free
driver   : xserver-xorg-video-nouveau - distro free builtin

Although the nvidia-driver-550 was recommended, the error persisted, so I switched to nvidia-driver-550-open. This resolved the issue, and both the error disappeared, and nvidia-smi worked as expected.

Interestingly, I discovered that the NVIDIA Driver Version 550.144.03 was released about a week ago, which coincides with when the error began. This makes me wonder if the issue was due to a bug in the newer release or some incompatibility specific to my setup.

How can I ensure this doesn’t happen again? I’m concerned that even with nvidia-driver-550-open, the error might reappear in the future. Any advice or insights on why this occurred and how to avoid similar issues in the future would be greatly appreciated.