CUDA device not initialized error on all calls, HGX A100, Centos 7

Hi,

I am attempting to set up a HGX A100 for use in a single node Kubernetes cluster.
The issue I am stuck on is just interacting with the GPUs from the host, ignoring docker or kubernetes.

I get a CUDA initialization error:

  • When running dcgmi diag -r 3: A variety of messages (attached) indicating there was a cuda initialisation error
  • When running the cuda-sample ./deviceQuery:
    deviceQuery:cudaGetDeviceCount returned 3
    → initialization error
    Result = FAIL
  • When running pyopencl or another library calling opencl: no platforms are detected

This indicates that there’s an issue because “the CUDA driver and runtime could not be initialized.?”
But I can’t see why that would be the case:

The drivers are all the same version, installed using yum package manager: 460.106.00
Fabricmanager seems to be working
We’ve restarted the host and disabled docker in case of a conflict.[diag-out.txt|attachment]
We have tried the 470 drivers as well, but had the same issue.
Initially we did not have fabricmanager installed, installing it got us to this point.

The only oddity is that nvlink does not seem to be working, the output of dcgmi nvlink --link-status is below. But I don’t think this is necessary?

+----------------------+
|  NvLink Link Status  |
+----------------------+
GPUs:
    gpuId 0:
        _ _ _ _ _ _ _ _ _ _ _ _
    gpuId 1:
        _ _ _ _ _ _ _ _ _ _ _ _
    gpuId 2:
        _ _ _ _ _ _ _ _ _ _ _ _
    gpuId 3:
        _ _ _ _ _ _ _ _ _ _ _ _
    gpuId 4:
        _ _ _ _ _ _ _ _ _ _ _ _
    gpuId 5:
        _ _ _ _ _ _ _ _ _ _ _ _
    gpuId 6:
        _ _ _ _ _ _ _ _ _ _ _ _
    gpuId 7:
        _ _ _ _ _ _ _ _ _ _ _ _
NvSwitches:
    physicalId 12:
        X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
    physicalId 13:
        X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
    physicalId 9:
        X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
    physicalId 8:
        X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
    physicalId 10:
        X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
    physicalId 11:
        X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

Attached the output of nvidia-bug-report, with the hostnames redacted.
Attached the fabricmanager.log
Attached also output of dcgmi diag -r 3

Help, I don’t have anything left to try!

(upload://jPaeC1HB7AWxRbuCeWaOZjHIYY4.txt) (11.2 KB)
fabricmanager.log (64.5 KB)
nvidia-bug-report-redacted.log.gz (3.0 MB)

edit: not sure the diagnostics output worked, so I’m attaching it for 0,1 inline (they’re all the same, just for brevity)

Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Deployment  --------+------------------------------------------------|
| Blacklist                 | Pass                                           |
| NVML Library              | Pass                                           |
| CUDA Main Library         | Pass                                           |
| Permissions and OS Blocks | Pass                                           |
| Persistence Mode          | Pass                                           |
| Environment Variables     | Pass                                           |
| Page Retirement/Row Remap | Pass                                           |
| Graphics Processes        | Pass                                           |
| Inforom                   | Pass                                           |
+-----  Integration  -------+------------------------------------------------+
| pcie                      | Fail - All                                     |
| Warning                   | GPU 0: Error using CUDA API cudaDeviceGetByPC  |
|                           | IBusId 'initialization error' for GPU 0, bus   |
|                           | ID = 00000000:07:00.0                          |
| Warning                   | GPU 1: Error using CUDA API cudaDeviceGetByPC  |
|                           | IBusId 'initialization error' for GPU 0, bus   |
|                           | ID = 00000000:07:00.0                          |
+-----  Hardware  ----------+------------------------------------------------+
| GPU Memory                | Fail - All                                     |
| Warning                   | GPU 0: Error using CUDA API cuInit Unable to   |
|                           | initialize CUDA library: 'initialization erro  |
|                           | r'.; verify that the fabric-manager has been   |
|                           | started if applicable                          |
| Warning                   | GPU 1: Error using CUDA API cuInit Unable to   |
|                           | initialize CUDA library: 'initialization erro  |
|                           | r'.; verify that the fabric-manager has been   |
|                           | started if applicable                          |
| diagnostic                | Fail - All                                     |
| Warning                   | GPU 0: API call cudaGetDeviceCount failed for  |
|                           |  GPU 0: 'initialization error', GPU 0: There   |
|                           | was an internal error during the test: 'Faile  |
|                           | d to initialize the plugin.'                   |
| Warning                   | GPU 1: There was an internal error during the  |
|                           |  test: 'Failed to initialize the plugin.'      |
+-----  Stress  ------------+------------------------------------------------+
| sm_stress                 | Fail - All                                     |
| Warning                   | GPU 0: There was an internal error during the  |
|                           |  test: 'Couldn't initialize the plugin, pleas  |
|                           | e check the log file.', GPU 0: API call cudaG  |
|                           | etDeviceCount failed for GPU 0: 'initializati  |
|                           | on error', GPU 0: Error using CUDA API cudaDe  |
|                           | viceGetByPCIBusId 'initialization error' for   |
|                           | GPU 0, bus ID = 00000000:07:00.0               |
| Warning                   | GPU 1: There was an internal error during the  |
|                           |  test: 'Couldn't initialize the plugin, pleas  |
|                           | e check the log file.', GPU 1: Error using CU  |
|                           | DA API cudaDeviceGetByPCIBusId 'initializatio  |
|                           | n error' for GPU 0, bus ID = 00000000:07:00.0  |
| targeted_stress           | Fail - All                                     |
| Warning                   | GPU 0: API call cudaGetDeviceCount failed for  |
|                           |  GPU 0: 'initialization error'                 |
| targeted_power            | Pass - All                                     |
| memory_bandwidth          | Fail - All                                     |
| Warning                   | GPU 0: API call cuInit failed for GPU 0: 'ini  |
|                           | tialization error; verify that the fabric-man  |
|                           | ager has been started if applicable'           |
+---------------------------+------------------------------------------------+

Drivers setup looks fine. Maybe some permissions problem, did you try running deviceQuery as root? Please post the output of
ls -l /dev/nv*

Thanks

ls -l /dev/nv*
crw-rw-rw- 1 root root 195,   0 Oct 28 15:35 /dev/nvidia0
crw-rw-rw- 1 root root 195,   1 Oct 28 15:35 /dev/nvidia1
crw-rw-rw- 1 root root 195,   2 Oct 28 15:35 /dev/nvidia2
crw-rw-rw- 1 root root 195,   3 Oct 28 15:35 /dev/nvidia3
crw-rw-rw- 1 root root 195,   4 Oct 28 15:35 /dev/nvidia4
crw-rw-rw- 1 root root 195,   5 Oct 28 15:35 /dev/nvidia5
crw-rw-rw- 1 root root 195,   6 Oct 28 15:35 /dev/nvidia6
crw-rw-rw- 1 root root 195,   7 Oct 28 15:35 /dev/nvidia7
crw-rw-rw- 1 root root 195, 255 Oct 28 15:35 /dev/nvidiactl
crw-rw-rw- 1 root root 195, 254 Oct 28 15:35 /dev/nvidia-modeset
crw-rw-rw- 1 root root 240,   0 Oct 28 15:35 /dev/nvidia-nvlink
crw-rw-rw- 1 root root 239,   0 Oct 28 15:35 /dev/nvidia-nvswitch0
crw-rw-rw- 1 root root 239,   1 Oct 28 15:35 /dev/nvidia-nvswitch1
crw-rw-rw- 1 root root 239,   2 Oct 28 15:35 /dev/nvidia-nvswitch2
crw-rw-rw- 1 root root 239,   3 Oct 28 15:35 /dev/nvidia-nvswitch3
crw-rw-rw- 1 root root 239,   4 Oct 28 15:35 /dev/nvidia-nvswitch4
crw-rw-rw- 1 root root 239,   5 Oct 28 15:35 /dev/nvidia-nvswitch5
crw-rw-rw- 1 root root 239, 255 Oct 28 15:35 /dev/nvidia-nvswitchctl
crw-rw-rw- 1 root root 236,   0 Oct 28 15:36 /dev/nvidia-uvm
crw-rw-rw- 1 root root 236,   1 Oct 28 15:36 /dev/nvidia-uvm-tools
crw------- 1 root root 245,   0 Oct 28 15:35 /dev/nvme0
brw-rw---- 1 root disk 259,   0 Oct 28 15:35 /dev/nvme0n1
brw-rw---- 1 root disk 259,   1 Oct 28 15:35 /dev/nvme0n1p1
brw-rw---- 1 root disk 259,   2 Oct 28 15:35 /dev/nvme0n1p2
brw-rw---- 1 root disk 259,   4 Oct 28 15:35 /dev/nvme0n1p3
brw-rw---- 1 root disk 259,   5 Oct 28 15:35 /dev/nvme0n1p4
crw------- 1 root root 245,   1 Oct 28 15:35 /dev/nvme1
brw-rw---- 1 root disk 259,   3 Oct 28 15:35 /dev/nvme1n1
brw-rw---- 1 root disk 259,   6 Oct 28 15:35 /dev/nvme1n1p1
brw-rw---- 1 root disk 259,   7 Oct 28 15:35 /dev/nvme1n1p2
brw-rw---- 1 root disk 259,   8 Oct 28 15:35 /dev/nvme1n1p3
brw-rw---- 1 root disk 259,   9 Oct 28 15:35 /dev/nvme1n1p4
crw------- 1 root root 245,   2 Oct 28 15:35 /dev/nvme2
brw-rw---- 1 root disk 259,  11 Oct 28 15:35 /dev/nvme2n1
crw------- 1 root root 245,   3 Oct 28 15:35 /dev/nvme3
brw-rw---- 1 root disk 259,  10 Oct 28 15:35 /dev/nvme3n1
crw------- 1 root root  10, 144 Oct 28 15:35 /dev/nvram

/dev/nvidia-caps:
total 0
cr-------- 1 root root 241, 0 Oct 28 15:35 nvidia-cap0
cr-------- 1 root root 241, 1 Oct 28 15:35 nvidia-cap1
cr--r--r-- 1 root root 241, 2 Oct 28 15:35 nvidia-cap2
[01/11 09:19:03] root@redactedhost release # ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 3
-> initialization error
Result = FAIL

Please check if the correct libcuda gets loaded and not some stray, incompatible one:
strace ./deviceQuery 2>&1 | grep “libcuda”

$strace ./deviceQuery 2>&1 | grep libcuda
open("/lib64/libcuda.so.1", O_RDONLY|O_CLOEXEC) = 3

$ ls -lah /lib64/libcuda.so.1 
lrwxrwxrwx 1 root root 21 Oct 28 12:40 /lib64/libcuda.so.1 -> libcuda.so.460.106.00

Looks like the right version.

Is it possible that it’s something to do with the NVLinks/NVSwitches being down? The user guide suggests that they should be working: NVIDIA HGX A100 Software User Guide :: NVIDIA Tesla Documentation

But there’s no errors in the FM logs, it just declares

Dumping all the detected NVLink connections
   1166 [Oct 29 2021 16:51:45] [INFO] [tid 139758] Total number of NVLink connections:0

Our nvidia-smi topo for comparison to the above guide:

$ nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	CPU Affinity	NUMA Affinity
GPU0	 X 	PXB	SYS	SYS	SYS	SYS	SYS	SYS	48-63,176-191	3
GPU1	PXB	 X 	SYS	SYS	SYS	SYS	SYS	SYS	48-63,176-191	3
GPU2	SYS	SYS	 X 	PXB	SYS	SYS	SYS	SYS	16-31,144-159	1
GPU3	SYS	SYS	PXB	 X 	SYS	SYS	SYS	SYS	16-31,144-159	1
GPU4	SYS	SYS	SYS	SYS	 X 	PXB	SYS	SYS	112-127,240-255	7
GPU5	SYS	SYS	SYS	SYS	PXB	 X 	SYS	SYS	112-127,240-255	7
GPU6	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PXB	80-95,208-223	5
GPU7	SYS	SYS	SYS	SYS	SYS	SYS	PXB	 X 	80-95,208-223	5

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Yes this is really odd, FM detects all switches and links, then tells there are none without any further errors. Nevertheless, cuda should work without them, AFAIK.
I’m quite out of ideas as nothing obvious is reported. Maybe also ask in the cuda forums:
https://forums.developer.nvidia.com/c/accelerated-computing/cuda/cuda-setup-and-installation/8

1 Like

For the benefit of any future finders of this topic

This was resolved by disabling MIG, which was enabled by default and disabling the NVLinks
I had seen this in the docs somewhere, but thought it was supposed to be off by default and didn’t spot on nvidia-smi that it wasn’t.

Thanks for reporting back.
This is a really vicious pitfall as MIG mode is only reported in the short nvidia-smi output, not in the long query and the nvidia-bug-report.log. Glad you found it.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.