deviceQuery invalid ordinal (RHEL 8) - 2 defunct GPUs?

Hi I just switched a remote box from Windows to RHEL 8.5. I had Cuda 11.5 installed and working on Windows on the same box. There are 8 GPUs but there has been an issue with the power cable on 2 of them since its inception a few years ago. The other 6 worked fine on Windows.

$ lspci | grep -i nvidia
04:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1)
04:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
05:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1)
05:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
08:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1)
08:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
09:00.0 VGA compatible controller: NVIDIA Corporation GV100GL [Quadro GV100] (rev a1)
09:00.1 Audio device: NVIDIA Corporation Device 10f2 (rev a1)
85:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1)
85:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
86:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1)
86:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
89:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1)
89:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
8a:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1)
8a:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)

After firmware installation via the module method

$ lsmod | grep -i nvidia
nvidia_drm             65536  18
nvidia_modeset       1146880  5 nvidia_drm
nvidia_uvm           1159168  0
nvidia              36892672  310 nvidia_uvm,nvidia_modeset
drm_kms_helper        253952  5 drm_vram_helper,ast,nvidia_drm
drm                   573440  28 drm_kms_helper,drm_vram_helper,ast,nvidia,drm_ttm_helper,nvidia_drm,ttm

$ lsmod | grep -i nouveau

After installing CUDA 11.5 through the package manager method in the installation guide, nvidia-smi works and shows 6 gpus

$ nvidia-smi

Mon Dec 27 10:33:05 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA TITAN Xp     Off  | 00000000:04:00.0 Off |                  N/A |
| 23%   15C    P8    10W / 250W |     16MiB / 12195MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA TITAN Xp     Off  | 00000000:05:00.0 Off |                  N/A |
| 23%   19C    P8     7W / 250W |      2MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA TITAN Xp     Off  | 00000000:08:00.0 Off |                  N/A |
| 23%   22C    P8     8W / 250W |      2MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Quadro GV100        Off  | 00000000:09:00.0 Off |                  Off |
| 29%   28C    P2    23W / 250W |      1MiB / 32508MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA TITAN Xp     Off  | 00000000:85:00.0 Off |                  N/A |
| 23%   16C    P8     8W / 250W |      2MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA TITAN Xp     Off  | 00000000:86:00.0 Off |                  N/A |
| 23%   16C    P8     7W / 250W |      2MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3293      G   /usr/libexec/Xorg                   9MiB |
|    0   N/A  N/A      5002      G   /usr/bin/gnome-shell                4MiB |
+-----------------------------------------------------------------------------+

Here is the topology matrix for the 6 working

$ nvidia-smi topo -m

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    CPU Affinity    NUMA Affinity
GPU0     X      PIX     PHB     PHB     SYS     SYS     0-15,32-47      0
GPU1    PIX      X      PHB     PHB     SYS     SYS     0-15,32-47      0
GPU2    PHB     PHB      X      PIX     SYS     SYS     0-15,32-47      0
GPU3    PHB     PHB     PIX      X      SYS     SYS     0-15,32-47      0
GPU4    SYS     SYS     SYS     SYS      X      PIX     16-31,48-63     1
GPU5    SYS     SYS     SYS     SYS     PIX      X      16-31,48-63     1

dmesg shows that there is probably an issue with the power cable of the 2 missing GPUs

$ dmesg | grep -i nvrm

[332678.090611] NVRM: GPU 0000:89:00.0: rm_init_adapter failed, device minor number 6
[332678.610732] NVRM: GPU 0000:89:00.0: GPU does not have the necessary power cables connected.
[332678.611535] NVRM: GPU 0000:89:00.0: RmInitAdapter failed! (0x24:0x1c:1433)
[332678.611588] NVRM: GPU 0000:89:00.0: rm_init_adapter failed, device minor number 6
[332679.123623] NVRM: GPU 0000:8a:00.0: GPU does not have the necessary power cables connected.
[332679.124491] NVRM: GPU 0000:8a:00.0: RmInitAdapter failed! (0x24:0x1c:1433)
[332679.124564] NVRM: GPU 0000:8a:00.0: rm_init_adapter failed, device minor number 7
[332679.630390] NVRM: GPU 0000:8a:00.0: GPU does not have the necessary power cables connected.
[332679.632674] NVRM: GPU 0000:8a:00.0: RmInitAdapter failed! (0x24:0x1c:1433)
[332679.632721] NVRM: GPU 0000:8a:00.0: rm_init_adapter failed, device minor number 7

However, these 6 gpus did work on the Windows installation, even after CUDA was updated several times.

However, I cannot get the deviceQuery sample to pass

$ ./deviceQuery 

Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 101
-> invalid device ordinal
Result = FAIL

So far I have tried

  1. Disabling the two non-working through environment variables, no effect.
  2. Draining and removing with nvidia-smi, could not establish communication with the two non-working.
  3. Disabling the slots explicitly in the BIOS, no effect - they still show up in lspci.

Could the power issue with 2 of GPUs be causing this? Our next move is fix the power issue or remove them, however like I said this machine is remote and it is a bit of pain to get to.

Given that the other 6 worked fine in Windows, is there something else I am missing that might work without having to physically access the box?