I have an AMD host system with one H100 on processor 1 and two A100s on processor 2. After some random interval of time measured in days or a few weeks, one and eventually both A100s will vanish from nvidia-smi and CUDA API visibility and the system dmesg log will fill up with
[865094.718348] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0xffff:2356)
[865094.720365] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
repeating every few seconds. I’ve tried every late-model version of the drivers and it does not stop this from happening.
It only happens to the A100s, not the H100. They are still visible in lspci, but bothing short of completely rebooting the system restores them - reloading the driver, nvidia-smi -r, and various pcie device reset requests have no effect. I am attaching the nvidia-bug-report output…
nvidia-bug-report.log.gz (2.5 MB)