Installing cuda 12 on RHEL 9.1

I’m running into some difficulty with the cuda installation on RHEL9.1. I have two RTX A4000s and am following the instructions on https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#network-repo-installation-for-rhel-9-rocky-9. The installation goes as expected, the issue is with the post-installation checks. Running the deviceQuery script from the cuda samples GitHub https://github.com/nvidia/cuda-samples. This runs as expected for device0, but hangs on devie1 and does not show any output. Running bandwidthTest from the same repo yields no errors.

As another datapoint, pytorch sees both devices as available, but does not run on the device. It allocates a small amount of memory, < 1gb, on both devices, but runs single-threaded on the cpu.

What is going on, and how can I fix the issues?

Thanks in advance

You’re probably going to have to provide more detail, (lspci output, nvidia related log messages), in order to get possible help.

Here’s the output from lscpci and the devicequery prior to hanging. I’m not certain about finding useful info in the log files, but am happy to provide it if you have a suggested place to look.

51:00.0 VGA compatible controller: NVIDIA Corporation GA104GL [RTX A4000] (rev a1)
51:00.1 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)
9c:00.0 VGA compatible controller: NVIDIA Corporation GA104GL [RTX A4000] (rev a1)
9c:00.1 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)

Here’s the output from deviceQuery:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "NVIDIA RTX A4000"
  CUDA Driver Version / Runtime Version          12.1 / 12.0
  CUDA Capability Major/Minor version number:    8.6
  Total amount of global memory:                 16086 MBytes (16866869248 bytes)
  (048) Multiprocessors, (128) CUDA Cores/MP:    6144 CUDA Cores
  GPU Max Clock rate:                            1560 MHz (1.56 GHz)
  Memory Clock rate:                             7001 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                No
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 81 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

The other sample program deviceQueryDrv in the cuda-samples repo completes successfully and shows


CUDA Device Query (Driver API) statically linked version 
Detected 2 CUDA Capable device(s)

Device 0: "NVIDIA RTX A4000"
  CUDA Driver Version:                           12.1
  CUDA Capability Major/Minor version number:    8.6
  Total amount of global memory:                 16086 MBytes (16866869248 bytes)
  (48) Multiprocessors, (128) CUDA Cores/MP:     6144 CUDA Cores
  GPU Max Clock rate:                            1560 MHz (1.56 GHz)
  Memory Clock rate:                             7001 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Max Texture Dimension Sizes                    1D=(131072) 2D=(131072, 65536) 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size (x,y,z):    (2147483647, 65535, 65535)
  Texture alignment:                             512 bytes
  Maximum memory pitch:                          2147483647 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                No
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 81 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "NVIDIA RTX A4000"
  CUDA Driver Version:                           12.1
  CUDA Capability Major/Minor version number:    8.6
  Total amount of global memory:                 16108 MBytes (16890462208 bytes)
  (48) Multiprocessors, (128) CUDA Cores/MP:     6144 CUDA Cores
  GPU Max Clock rate:                            1560 MHz (1.56 GHz)
  Memory Clock rate:                             7001 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Max Texture Dimension Sizes                    1D=(131072) 2D=(131072, 65536) 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size (x,y,z):    (2147483647, 65535, 65535)
  Texture alignment:                             512 bytes
  Maximum memory pitch:                          2147483647 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                No
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 156 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer-to-Peer (P2P) access from NVIDIA RTX A4000 (GPU0) -> NVIDIA RTX A4000 (GPU1) : Yes
> Peer-to-Peer (P2P) access from NVIDIA RTX A4000 (GPU1) -> NVIDIA RTX A4000 (GPU0) : Yes
Result = PASS

Here’s nvidia-smi as well:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A4000                On | 00000000:51:00.0 Off |                  Off |
| 41%   36C    P8               17W / 140W|      0MiB / 16376MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A4000                On | 00000000:9C:00.0 Off |                  Off |
| 41%   28C    P8                7W / 140W|      0MiB / 16376MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Looking at the idle power draw on the second card, significantly lower than the device0, makes me wonder if the second card has it’s 6pin auxillary power connector either not fitted, not fully seated or otherwise compromised. The card will not function correctly without it.

I’ve checked the power connector and it appears to be seated correctly in the gpu. I’ve not gone so far as to check the cable itself.