NVRM: GPU 0000:01:00.0: RmInitAdapter failed! - driver 530.30.02

Hello,

I have an AMD host system with one H100 on processor 1 and two A100s on processor 2. After some random interval of time measured in days or a few weeks, one and eventually both A100s will vanish from nvidia-smi and CUDA API visibility and the system dmesg log will fill up with

[865094.718348] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0xffff:2356)
[865094.720365] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
repeating every few seconds. I’ve tried every late-model version of the drivers and it does not stop this from happening.

It only happens to the A100s, not the H100. They are still visible in lspci, but bothing short of completely rebooting the system restores them - reloading the driver, nvidia-smi -r, and various pcie device reset requests have no effect. I am attaching the nvidia-bug-report output…

nvidia-bug-report.log.gz (2.5 MB)

Looks like the GSP is hanging. Please make sure nvidia-persistenced is running. Another option might be disabling the GSP fw but I think this is not possible anymore.
Did this also happen with an earlier driver?

persistenced should be running at boot but it apparently has failed to do so…

This started occurring after the H100 was installed, which coincided with the installation of the cuda 12.0 driver. Updating to the drivers that come with 12.0.1 and now 12.1 has had no effect. It did not occur with 11.7 or earlier.