NVRM: GPU 0000:01:00.0: RmInitAdapter failed! - driver 530.30.02

erik-k · March 27, 2023, 10:39pm

Hello,

I have an AMD host system with one H100 on processor 1 and two A100s on processor 2. After some random interval of time measured in days or a few weeks, one and eventually both A100s will vanish from nvidia-smi and CUDA API visibility and the system dmesg log will fill up with

[865094.718348] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0xffff:2356)
[865094.720365] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
repeating every few seconds. I’ve tried every late-model version of the drivers and it does not stop this from happening.

It only happens to the A100s, not the H100. They are still visible in lspci, but bothing short of completely rebooting the system restores them - reloading the driver, nvidia-smi -r, and various pcie device reset requests have no effect. I am attaching the nvidia-bug-report output…

nvidia-bug-report.log.gz (2.5 MB)

generix · March 28, 2023, 9:55am

Looks like the GSP is hanging. Please make sure nvidia-persistenced is running. Another option might be disabling the GSP fw but I think this is not possible anymore.
Did this also happen with an earlier driver?

erik-k · March 30, 2023, 9:12pm

persistenced should be running at boot but it apparently has failed to do so…

This started occurring after the H100 was installed, which coincided with the installation of the cuda 12.0 driver. Updating to the drivers that come with 12.0.1 and now 12.1 has had no effect. It did not occur with 11.7 or earlier.

cholaipc · November 27, 2023, 7:28pm

Hi any idea How to solve this ?

samikr · December 8, 2023, 10:53am

Any update on this? My nvidia-persistenced is running, however still getting this error.

xing1 · December 9, 2023, 1:04am

How are the A100/H100 installed physically installed in the system? If there are external pcie cables involved, I suspect bad or substandard quality pcie cables involved. pcie cables are very sensitive if the underlying awg wire is not well insulated or thick enough.

Bad cables will allow the system to boot and see gpu but once you start messaging 10+GBs/s over them, they will likely cause the gpu to drop off the pcie bus and present other errors.

erik-k · March 29, 2024, 2:01am

Bit of an old question but, we ultimately resolved this by removing the A100s to another system. The server in question is now fine with one H100.

As for bad cabling, the system is a SuperMicro AS-4124. I’ve never gotten anything from SM that I would describe as substandard quality. Sometimes overdesigned, occasionally in a downright stupid way, yes… But in any case, yes the 4124 has a separate PCIe backplane with lanes running to the motherboard with a pcie4-x16 cable per slot.

generix · March 29, 2024, 2:33am

This might be a hint, especially in conjunction with the gsp:
https://forums.developer.nvidia.com/t/mixed-physical-gpu-support/283861

Topic		Replies	Views
NVRM: GPU 0000:64:00.0: RmInitAdapter failed! Linux	6	4550	March 17, 2023
NVRM: RmInitAdapter failed! Linux	5	5280	June 2, 2020
GPU devices lost with 'NVRM: RmInitAdapter failed' When CPU or Network is busy Linux	4	7731	May 22, 2021
Nvidia-smi No devices were found Linux	1	966	July 8, 2020
RmInitAdapter failed on Ubuntu 22.04 with Quadro GV100 and isn't detected by nvidia-smi Linux	1	456	January 10, 2024
RmInitAdapter failed! Linux boot	11	9379	November 12, 2021
NVRM: RmInitAdapter failed! and one GPU missing (Ubuntu 16.04 with 2 x 1080ti) Linux	2	4486	August 3, 2018
2 GPUs installed on system, but only 1 attached by nvidia-smi CUDA Setup and Installation	1	1209	March 23, 2019
NVRM: RmInitAdapter failed! (debian 12.0) Linux	2	578	June 13, 2023
RmInitAdapter failed (repeatedly) for one of two RTX2080TI on Ubuntu 18.04 CUDA Setup and Installation	6	2775	August 14, 2020

NVRM: GPU 0000:01:00.0: RmInitAdapter failed! - driver 530.30.02

Related topics