Problem with A40, "No devices were found" and "rm_init_adapter failed"

Hi,

I’ve checked all related topics and other forums, and I believe I’m facing something new:

Background: I’ve got 8x A40 for upgrading my existing servers which is running with 8x 1080ti. For similar case, I’ve already upgraded 2 servers with 4x 3090 (considering the pwr cap, I didn’t put 8x 3090 in one server; instead I put only 4x 3090 to replace 8x 1080ti), and they are doing just fine.
So, I performed exactly same actions to the new A40, unfortunately they are not working at all.

I did some research about the behaviors:

-# lspci |grep NV
06:00.0 3D controller: NVIDIA Corporation Device 2235 (rev a1)
07:00.0 3D controller: NVIDIA Corporation Device 2235 (rev a1)
0c:00.0 3D controller: NVIDIA Corporation Device 2235 (rev a1)
0f:00.0 3D controller: NVIDIA Corporation Device 2235 (rev a1)

-# nvidia-smi
No devices were found

-# dmesg | grep NVRM
[ 6.034028] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 465.19.01 Fri Mar 19 07:44:41 UTC 2021
[ 7.993656] NVRM: GPU 0000:06:00.0: RmInitAdapter failed! (0x23:0xffff:643)
[ 7.993810] NVRM: GPU 0000:06:00.0: rm_init_adapter failed, device minor number 0
[ 7.996737] NVRM: GPU 0000:06:00.0: RmInitAdapter failed! (0x23:0xffff:643)
[ 7.996880] NVRM: GPU 0000:06:00.0: rm_init_adapter failed, device minor number 0
[ 9.084417] NVRM: GPU 0000:07:00.0: RmInitAdapter failed! (0x25:0xffff:1214)
[ 9.084617] NVRM: GPU 0000:07:00.0: rm_init_adapter failed, device minor number 1
[ 9.917092] NVRM: GPU 0000:07:00.0: RmInitAdapter failed! (0x25:0xffff:1214)
[ 9.917247] NVRM: GPU 0000:07:00.0: rm_init_adapter failed, device minor number 1
[ 9.920750] NVRM: GPU 0000:0c:00.0: RmInitAdapter failed! (0x23:0xffff:643)
[ 9.920848] NVRM: GPU 0000:0c:00.0: rm_init_adapter failed, device minor number 2
[ 9.923084] NVRM: GPU 0000:0c:00.0: RmInitAdapter failed! (0x23:0xffff:643)
[ 9.923172] NVRM: GPU 0000:0c:00.0: rm_init_adapter failed, device minor number 2
[ 11.003783] NVRM: GPU 0000:0f:00.0: RmInitAdapter failed! (0x25:0xffff:1214)
[ 11.004058] NVRM: GPU 0000:0f:00.0: rm_init_adapter failed, device minor number 3
[ 11.847454] NVRM: GPU 0000:0f:00.0: RmInitAdapter failed! (0x25:0xffff:1214)
[ 11.847590] NVRM: GPU 0000:0f:00.0: rm_init_adapter failed, device minor number 3
[ 12.506560] NVRM: GPU 0000:06:00.0: RmInitAdapter failed! (0x23:0xffff:643)

Then I tried:

Re-install OS to newest ubuntu 20.04;
Added “mem_encrypt=off” to grub;
Tried different version of CUDA installation package such as: 10.2 11.0 11.1 11.2;
Put one A40 to my desktop workstation (also Ubuntu 20.04, original installed Titan Xp SLi).

And none of them is working.

So, I’m here to ask is there any way to figure out the issues?

BTW, I’ve tried unplug the one of the 8Pin connector, then system cannot find A40 at all. I think that means all my 8*A40 are not dead.

Thank you so much for your kindly help!

Kun

No one?

nvidia-bug-report.log (1.9 MB)
Bug report