NVRM: GPU 0000:01:00.0: RmInitAdapter failed! - driver 530.30.02

Hello,

I have an AMD host system with one H100 on processor 1 and two A100s on processor 2. After some random interval of time measured in days or a few weeks, one and eventually both A100s will vanish from nvidia-smi and CUDA API visibility and the system dmesg log will fill up with

[865094.718348] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0xffff:2356)
[865094.720365] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
repeating every few seconds. I’ve tried every late-model version of the drivers and it does not stop this from happening.

It only happens to the A100s, not the H100. They are still visible in lspci, but bothing short of completely rebooting the system restores them - reloading the driver, nvidia-smi -r, and various pcie device reset requests have no effect. I am attaching the nvidia-bug-report output…

nvidia-bug-report.log.gz (2.5 MB)

Looks like the GSP is hanging. Please make sure nvidia-persistenced is running. Another option might be disabling the GSP fw but I think this is not possible anymore.
Did this also happen with an earlier driver?

persistenced should be running at boot but it apparently has failed to do so…

This started occurring after the H100 was installed, which coincided with the installation of the cuda 12.0 driver. Updating to the drivers that come with 12.0.1 and now 12.1 has had no effect. It did not occur with 11.7 or earlier.

Hi any idea How to solve this ?

Any update on this? My nvidia-persistenced is running, however still getting this error.

How are the A100/H100 installed physically installed in the system? If there are external pcie cables involved, I suspect bad or substandard quality pcie cables involved. pcie cables are very sensitive if the underlying awg wire is not well insulated or thick enough.

Bad cables will allow the system to boot and see gpu but once you start messaging 10+GBs/s over them, they will likely cause the gpu to drop off the pcie bus and present other errors.

Bit of an old question but, we ultimately resolved this by removing the A100s to another system. The server in question is now fine with one H100.

As for bad cabling, the system is a SuperMicro AS-4124. I’ve never gotten anything from SM that I would describe as substandard quality. Sometimes overdesigned, occasionally in a downright stupid way, yes… But in any case, yes the 4124 has a separate PCIe backplane with lanes running to the motherboard with a pcie4-x16 cable per slot.

This might be a hint, especially in conjunction with the gsp:
https://forums.developer.nvidia.com/t/mixed-physical-gpu-support/283861