RMInitAdapter Failed

Hi,
Our software is deployed on various different hosts around the world. We have been using different nvidia gpus and older nividia drivers. These are nvidia details on different hosts where we faced RMInitAdapter Failed error in last 2 weeks.

name, pci.bus_id, driver_version, pstate, pcie.link.gen.max, pcie.link.gen.current
GeForce RTX 3070, 00000000:01:00.0,460.39, P0,3,3
Quadro RTX 4000, 00000000:01:00.0,460.8, P0,3,3
Quadro RTX 4000, 00000000:01:00.0, 460.73.01, P0,3,3
Quadro RTX 4000, 00000000:01:00.0,460.8, P0,3,3
GeForce RTX 3060 Ti, 00000000:01:00.0,460.8, P2,3,3
Quadro RTX 4000, 00000000:01:00.0,460.8, P0,3,3
NVIDIA GeForce RTX 2070 SUPER, 00000000:01:00.0, 470.129.06, P0,3,3
Quadro RTX 4000, 00000000:01:00.0,460.8, P0,3,3
GeForce RTX 2070 SUPER, 00000000:01:00.0,460.8, P2,3,3
Quadro RTX 4000, 00000000:01:00.0, 470.103.01, P0,3,3
Quadro RTX 4000, 00000000:01:00.0, 460.91.03, P0,3,3

At the moment, I am hoping the best action to take is upgrade nvidia drivers to the latest stable version. However, upgrading nvidia drivers is gonna be quite costly considering the number of clients we have ( ~ few 1000 servers). Which is why I want to be sure whether upgrade is necessary and if so, will the latest version work for all GPU models.

Below, I have attached nvidia-bug-report from one of the client host. I have renamed the hostname of the machine in the log file to UBUNTU :-
0247_20220907_nvidia-bug-report-HOSTNAMECHANGEDlog.gz (1.9 MB)

Thank you.

The specific error numbers RmInitAdapter failed! (0x24:0xffff:1248) would rather point to a hardware issue. This might as well be just a temporary failure which can be fixed by a simple reboot. In that case, you should make sure that the nvidia-persistenced daemon is started on boot and is continuously running.
If that doesn’t help, I guess the gpu is broken and needs to be replaced. A driver update won’t help.

1 Like