We have two identical machines both use an Ubuntu 20.04 and each machine is equipped with four Nvidia Tesla T4. We have installed the nvidia-driver-470
via the package repository. One machine works pretty fine, but on the other machine nvidia-smi
only reports the status for two out of four T4s:
root@iocr-gpu-2a:~# nvidia-smi
Wed Dec 22 19:57:20 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.86 Driver Version: 470.86 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:18:00.0 Off | 0 |
| N/A 49C P0 27W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:86:00.0 Off | 0 |
| N/A 45C P0 28W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Four cards are installed:
root@iocr-gpu-2a:~# lspci | grep -i nv
18:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
3b:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
86:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
af:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
I am getting the following error:
root@iocr-gpu-2a:~# dmesg | grep NVRM
[ 12.165630] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 470.86 Tue Oct 26 21:55:45 UTC 2021
[ 92.177385] NVRM: GPU 0000:3b:00.0: RmInitAdapter failed! (0x25:0xffff:1250)
[ 92.177551] NVRM: GPU 0000:3b:00.0: rm_init_adapter failed, device minor number 1
[ 92.718041] NVRM: GPU 0000:3b:00.0: RmInitAdapter failed! (0x25:0xffff:1250)
[ 92.718200] NVRM: GPU 0000:3b:00.0: rm_init_adapter failed, device minor number 1
[ 94.194086] NVRM: GPU 0000:af:00.0: RmInitAdapter failed! (0x25:0xffff:1250)
[ 94.194219] NVRM: GPU 0000:af:00.0: rm_init_adapter failed, device minor number 3
[ 94.733179] NVRM: GPU 0000:af:00.0: RmInitAdapter failed! (0x25:0xffff:1250)
[ 94.733297] NVRM: GPU 0000:af:00.0: rm_init_adapter failed, device minor number 3
I have not found a way to get these four cards working. On the other machine everything has worked as expected without any RmInitAdapter failed!
.
Has anybody an idea?
Thanks a lot.
nvidia-bug-report.log.gz (1.4 MB)