I have used 2 Geforce RTX 4060 TI in server for deep learning for 4 months. But as I try using the 2 gpus in one docker container, “end kernel panic” error occurred. Since it happens upon booting, it is hard for me to troubleshoot on my end.
I managed to resolve the issue by installing a different version of the NVIDIA driver. However, it’s still puzzling because the previous driver version should have been compatible with my GPU (RTX 4060 Ti), according to the NVIDIA driver compatibility check website: NVIDIA Driver Downloads.
Below is the output of the ubuntu-drivers devices command, which lists the available drivers. I initially chose the nvidia-driver-550 as it was marked as the “distro non-free recommended” option:
Although the nvidia-driver-550 was recommended, the error persisted, so I switched to nvidia-driver-550-open. This resolved the issue, and both the error disappeared, and nvidia-smi worked as expected.
Interestingly, I discovered that the NVIDIA Driver Version 550.144.03 was released about a week ago, which coincides with when the error began. This makes me wonder if the issue was due to a bug in the newer release or some incompatibility specific to my setup.
How can I ensure this doesn’t happen again? I’m concerned that even with nvidia-driver-550-open, the error might reappear in the future. Any advice or insights on why this occurred and how to avoid similar issues in the future would be greatly appreciated.