GPU devices not found after kernel update Ubuntu 20.04

our GPU-node runs Ubuntu 20.04 with 8x NVIDIA A100 GPUs. After a restart + kernel update, it seems that the GPUs can not be found anymore.

New kernel:
$ uname -ra
Linux gpu1-mat 5.4.0-122-generic #138-Ubuntu SMP Wed Jun 22 15:00:31 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

$ nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Purging old drivers and updating to the newest driver “nvidia-driver-515” did not help.

It seems that the hardware can not be found:
$ sudo prime-select nvidia
Error: no integrated GPU detected.

$ sudo ls /dev/nvidia*
ls: cannot access ‘/dev/nvidia*’: No such file or directory

Both lspci and lshw do not find any NVIDIA GPUs.

Nvidia Bug Report:
nvidia-bug-report.log (5.1 MB)

Is more information needed? Any ideas to resolve this issue?

I guess the gpus are sitting on an sxm expansion board which after reboot went missing (including all gpus). Please contact supermicro on how to get it going again or whether it’s broken.

Thanks for the quick response!

Hello @sebastian.schmoe @generix; have you resolved this?
I am facing similar issue after the kernel update, my setup has multiple A100 cards, but I am able to see the GPUs via lshw. None of nvidia-driver-515, 510, 495, 470 work, nvidia-smi shows no device found… Could you advise what we can try? thank you.

If the gpus are still visible, this is something completely different. Please run as root and attach the resulting nvidia-bug-report.log.gz file to your post.

@generix After looking deeper into the issue, it was our slurm configuration issue. Thank you for your response and appreciate your attention. :)