NVRM: RmInitAdapter failed!

Hello,

We have three PCs, one has a 2080 Ti GPU and the other two have a 3090 GPU. All three are fine when I load CentOS 8.2, until I install CUDA or the GeForce RTX drivers from the NVIDIA web site. Once I install those, the GPUs don’t always initialize during boot and give the following errors:

[ 1.591037] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 460.32.03 Sun Dec 27 19:00:34 UTC 2020
[ 15.124144] NVRM: GPU 0000:0a:00.0: RmInitAdapter failed! (0x26:0xffff:1290)
[ 15.124176] NVRM: GPU 0000:0a:00.0: rm_init_adapter failed, device minor number 0
[ 15.169570] NVRM: GPU 0000:0a:00.0: RmInitAdapter failed! (0x24:0xffff:1248)
[ 15.169588] NVRM: GPU 0000:0a:00.0: rm_init_adapter failed, device minor number 0
[ 24.491038] NVRM: GPU 0000:0a:00.0: RmInitAdapter failed! (0x26:0xffff:1290)
[ 24.491071] NVRM: GPU 0000:0a:00.0: rm_init_adapter failed, device minor number 0

It does not happen during every boot, but on average at least once every three reboots. I’ve exhausted all BIOS and kernel tweaks I could find. I tried Asus and Gigabyte motherboards and Asus and Gigabyte branded 2080 Ti GPUs. The problem persists across all.

Any suggestions?

Thank you,
Bart

nvidia-bug-report.log.gz (511.1 KB)

I am wondering how you established that they were fine, without a driver installed? Are you saying none of these GPUs works correctly for you, and there is no pattern (problem seems specific to a particular machine or specific to a particular GPU) when you exchange the GPUs cyclically between machines?

The cards display without issues with the in-kernel drivers and no NVRM errors occur in dmesg during boot.

And yes, no particular pattern. The problem occurs for 3 different 2080 Ti cards and 2 different RTX 3090 cards. Also tried 3 Gigabyte motherboards and 1 Asus motherboard.

I also tried with 4G Encoding enabled and disabled in the BIOS.

I have never encountered case like this. So no idea what could be going on. I assume you followed all the standard procedures like blacklisting Nouveau drivers etc.

You might want to consult a local expert who can take a look at these systems. There might be something about this situation that is not reflected here, because it did not seem relevant, but becomes obvious if someone is physically in front of the machines.

I have external RTX 3090, connected via thunderbolt. In original drivers it doesn’t initialise at all, but there is a solution.

This open kernel modules works fine with RTX 3090, but they lacks some power saving features, + it have some problems with S3 sleep.

I run it with options nvidia NVreg_OpenRmEnableUnsupportedGpus=1 kernel module parameter.