CUDA device not initialized error on all calls, HGX A100, Centos 7 (Crosspost from Linux Forum)

Original post here for context, as generix was unable to solve it: CUDA device not initialized error on all calls, HGX A100, Centos 7

Hi,

I am attempting to set up a HGX A100 for use in a single node Kubernetes cluster.
The issue I am stuck on is just interacting with the GPUs from the host, ignoring docker or kubernetes.

I get a CUDA initialization error:

  • When running dcgmi diag -r 3: A variety of messages (attached) indicating there was a cuda initialisation error
  • When running the cuda-sample ./deviceQuery:
    deviceQuery:cudaGetDeviceCount returned 3
    → initialization error
    Result = FAIL
  • When running pyopencl or another library calling opencl: no platforms are detected

This indicates that there’s an issue because “the CUDA driver and runtime could not be initialized.?”
But I can’t see why that would be the case:

The drivers are all the same version, installed using yum package manager: 460.106.00
Fabricmanager seems to be working
We’ve restarted the host and disabled docker in case of a conflict.[diag-out.txt|attachment]
We have tried the 470 drivers as well, but had the same issue.
Initially we did not have fabricmanager installed, installing it got us to this point.

The only oddity is that nvlink does not seem to be working, the output of dcgmi nvlink --link-status is below. But I don’t think this is necessary?

+----------------------+
|  NvLink Link Status  |
+----------------------+
GPUs:
    gpuId 0:
        _ _ _ _ _ _ _ _ _ _ _ _
    gpuId 1:
        _ _ _ _ _ _ _ _ _ _ _ _
    gpuId 2:
        _ _ _ _ _ _ _ _ _ _ _ _
    gpuId 3:
        _ _ _ _ _ _ _ _ _ _ _ _
    gpuId 4:
        _ _ _ _ _ _ _ _ _ _ _ _
    gpuId 5:
        _ _ _ _ _ _ _ _ _ _ _ _
    gpuId 6:
        _ _ _ _ _ _ _ _ _ _ _ _
    gpuId 7:
        _ _ _ _ _ _ _ _ _ _ _ _
NvSwitches:
    physicalId 12:
        X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
    physicalId 13:
        X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
    physicalId 9:
        X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
    physicalId 8:
        X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
    physicalId 10:
        X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
    physicalId 11:
        X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

Attached the output of nvidia-bug-report, with the hostnames redacted.
Attached the fabricmanager.log
Attached also output of dcgmi diag -r 3

Help, I don’t have anything left to try!
diag-out.txt (11.2 KB)
fabricmanager.log (64.5 KB)
nvidia-bug-report-redacted.log.gz (3.0 MB)