RmInitAdapter failed for two out of four Tesla T4

We have two identical machines both use an Ubuntu 20.04 and each machine is equipped with four Nvidia Tesla T4. We have installed the nvidia-driver-470 via the package repository. One machine works pretty fine, but on the other machine nvidia-smi only reports the status for two out of four T4s:

root@iocr-gpu-2a:~# nvidia-smi
Wed Dec 22 19:57:20 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.86       Driver Version: 470.86       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:18:00.0 Off |                    0 |
| N/A   49C    P0    27W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:86:00.0 Off |                    0 |
| N/A   45C    P0    28W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Four cards are installed:

root@iocr-gpu-2a:~# lspci | grep -i nv
18:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
3b:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
86:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
af:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

I am getting the following error:

root@iocr-gpu-2a:~# dmesg | grep NVRM
[   12.165630] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  470.86  Tue Oct 26 21:55:45 UTC 2021
[   92.177385] NVRM: GPU 0000:3b:00.0: RmInitAdapter failed! (0x25:0xffff:1250)
[   92.177551] NVRM: GPU 0000:3b:00.0: rm_init_adapter failed, device minor number 1
[   92.718041] NVRM: GPU 0000:3b:00.0: RmInitAdapter failed! (0x25:0xffff:1250)
[   92.718200] NVRM: GPU 0000:3b:00.0: rm_init_adapter failed, device minor number 1
[   94.194086] NVRM: GPU 0000:af:00.0: RmInitAdapter failed! (0x25:0xffff:1250)
[   94.194219] NVRM: GPU 0000:af:00.0: rm_init_adapter failed, device minor number 3
[   94.733179] NVRM: GPU 0000:af:00.0: RmInitAdapter failed! (0x25:0xffff:1250)
[   94.733297] NVRM: GPU 0000:af:00.0: rm_init_adapter failed, device minor number 3

I have not found a way to get these four cards working. On the other machine everything has worked as expected without any RmInitAdapter failed!.

Has anybody an idea?

Thanks a lot.
nvidia-bug-report.log.gz (1.4 MB)

If the software and firmware setup between the two machines is the same then it can only come down to hardware differences.

You could try (systematically) swapping cards around, and if the non-reporting cards from one machine are active and reporting in the other then you know the cards themselves are fine, and in turn this could indicate an issue with the server mainboard. If that’s the case then contacting the supplier would be the next step (who should either replace directly or pass on to the manufacturer).

You should also check connections, adapter seating, and power lines, as well as test the PSU to make sure it’s supplying enough power (e.g. has one of your redundant PSUs failed?)

Thanks a lot for your reply.

We did the check today and it seems that two of of four GPU cards are not recognized by the NVIDIA driver. All mainboard slots are fully functional. Each working card gets found regardless of the PCIe slot. The only thing we want to investigate before we contact our supplier is, if there is a difference between the firmware (Video BIOS) of the cards.

I have found that HPE offers a flashing utility, but I am note sure if this is a wise course of action:
https://support.hpe.com/hpesc/public/swd/detail?swItemId=MTX-e3bb28a62d33469780d0167771

Pretty certain that would void any warranty the supplier would provide. Given these must be new machines I’d just contact the supplier and request a warranty replacement. :shrug:

With the driver not working on the broken cards, you can only check the vbios version using nvflash. The working ones both have VBIOS v90.04.38.00.03