Hi I just switched a remote box from Windows to RHEL 8.5. I had Cuda 11.5 installed and working on Windows on the same box. There are 8 GPUs but there has been an issue with the power cable on 2 of them since its inception a few years ago. The other 6 worked fine on Windows.
$ lspci | grep -i nvidia
04:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1)
04:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
05:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1)
05:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
08:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1)
08:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
09:00.0 VGA compatible controller: NVIDIA Corporation GV100GL [Quadro GV100] (rev a1)
09:00.1 Audio device: NVIDIA Corporation Device 10f2 (rev a1)
85:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1)
85:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
86:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1)
86:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
89:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1)
89:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
8a:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1)
8a:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
After firmware installation via the module method
$ lsmod | grep -i nvidia
nvidia_drm 65536 18
nvidia_modeset 1146880 5 nvidia_drm
nvidia_uvm 1159168 0
nvidia 36892672 310 nvidia_uvm,nvidia_modeset
drm_kms_helper 253952 5 drm_vram_helper,ast,nvidia_drm
drm 573440 28 drm_kms_helper,drm_vram_helper,ast,nvidia,drm_ttm_helper,nvidia_drm,ttm
$ lsmod | grep -i nouveau
After installing CUDA 11.5 through the package manager method in the installation guide, nvidia-smi
works and shows 6 gpus
$ nvidia-smi
Mon Dec 27 10:33:05 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA TITAN Xp Off | 00000000:04:00.0 Off | N/A |
| 23% 15C P8 10W / 250W | 16MiB / 12195MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA TITAN Xp Off | 00000000:05:00.0 Off | N/A |
| 23% 19C P8 7W / 250W | 2MiB / 12196MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA TITAN Xp Off | 00000000:08:00.0 Off | N/A |
| 23% 22C P8 8W / 250W | 2MiB / 12196MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Quadro GV100 Off | 00000000:09:00.0 Off | Off |
| 29% 28C P2 23W / 250W | 1MiB / 32508MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA TITAN Xp Off | 00000000:85:00.0 Off | N/A |
| 23% 16C P8 8W / 250W | 2MiB / 12196MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA TITAN Xp Off | 00000000:86:00.0 Off | N/A |
| 23% 16C P8 7W / 250W | 2MiB / 12196MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3293 G /usr/libexec/Xorg 9MiB |
| 0 N/A N/A 5002 G /usr/bin/gnome-shell 4MiB |
+-----------------------------------------------------------------------------+
Here is the topology matrix for the 6 working
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 CPU Affinity NUMA Affinity
GPU0 X PIX PHB PHB SYS SYS 0-15,32-47 0
GPU1 PIX X PHB PHB SYS SYS 0-15,32-47 0
GPU2 PHB PHB X PIX SYS SYS 0-15,32-47 0
GPU3 PHB PHB PIX X SYS SYS 0-15,32-47 0
GPU4 SYS SYS SYS SYS X PIX 16-31,48-63 1
GPU5 SYS SYS SYS SYS PIX X 16-31,48-63 1
dmesg
shows that there is probably an issue with the power cable of the 2 missing GPUs
$ dmesg | grep -i nvrm
[332678.090611] NVRM: GPU 0000:89:00.0: rm_init_adapter failed, device minor number 6
[332678.610732] NVRM: GPU 0000:89:00.0: GPU does not have the necessary power cables connected.
[332678.611535] NVRM: GPU 0000:89:00.0: RmInitAdapter failed! (0x24:0x1c:1433)
[332678.611588] NVRM: GPU 0000:89:00.0: rm_init_adapter failed, device minor number 6
[332679.123623] NVRM: GPU 0000:8a:00.0: GPU does not have the necessary power cables connected.
[332679.124491] NVRM: GPU 0000:8a:00.0: RmInitAdapter failed! (0x24:0x1c:1433)
[332679.124564] NVRM: GPU 0000:8a:00.0: rm_init_adapter failed, device minor number 7
[332679.630390] NVRM: GPU 0000:8a:00.0: GPU does not have the necessary power cables connected.
[332679.632674] NVRM: GPU 0000:8a:00.0: RmInitAdapter failed! (0x24:0x1c:1433)
[332679.632721] NVRM: GPU 0000:8a:00.0: rm_init_adapter failed, device minor number 7
However, these 6 gpus did work on the Windows installation, even after CUDA was updated several times.
However, I cannot get the deviceQuery sample to pass
$ ./deviceQuery
Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 101
-> invalid device ordinal
Result = FAIL
So far I have tried
- Disabling the two non-working through environment variables, no effect.
- Draining and removing with
nvidia-smi
, could not establish communication with the two non-working. - Disabling the slots explicitly in the BIOS, no effect - they still show up in
lspci
.
Could the power issue with 2 of GPUs be causing this? Our next move is fix the power issue or remove them, however like I said this machine is remote and it is a bit of pain to get to.
Given that the other 6 worked fine in Windows, is there something else I am missing that might work without having to physically access the box?