CUDA device not initialized error on all calls, HGX A100, Centos 7 (Crosspost from Linux Forum)

user22112 · November 1, 2021, 1:45pm

Original post here for context, as generix was unable to solve it: CUDA device not initialized error on all calls, HGX A100, Centos 7

Hi,

I am attempting to set up a HGX A100 for use in a single node Kubernetes cluster.
The issue I am stuck on is just interacting with the GPUs from the host, ignoring docker or kubernetes.

I get a CUDA initialization error:

When running dcgmi diag -r 3: A variety of messages (attached) indicating there was a cuda initialisation error
When running the cuda-sample ./deviceQuery:
deviceQuery:cudaGetDeviceCount returned 3
→ initialization error
Result = FAIL
When running pyopencl or another library calling opencl: no platforms are detected

This indicates that there’s an issue because “the CUDA driver and runtime could not be initialized.?”
But I can’t see why that would be the case:

The drivers are all the same version, installed using yum package manager: 460.106.00
Fabricmanager seems to be working
We’ve restarted the host and disabled docker in case of a conflict.[diag-out.txt|attachment]
We have tried the 470 drivers as well, but had the same issue.
Initially we did not have fabricmanager installed, installing it got us to this point.

The only oddity is that nvlink does not seem to be working, the output of dcgmi nvlink --link-status is below. But I don’t think this is necessary?

+----------------------+
|  NvLink Link Status  |
+----------------------+
GPUs:
    gpuId 0:
        _ _ _ _ _ _ _ _ _ _ _ _
    gpuId 1:
        _ _ _ _ _ _ _ _ _ _ _ _
    gpuId 2:
        _ _ _ _ _ _ _ _ _ _ _ _
    gpuId 3:
        _ _ _ _ _ _ _ _ _ _ _ _
    gpuId 4:
        _ _ _ _ _ _ _ _ _ _ _ _
    gpuId 5:
        _ _ _ _ _ _ _ _ _ _ _ _
    gpuId 6:
        _ _ _ _ _ _ _ _ _ _ _ _
    gpuId 7:
        _ _ _ _ _ _ _ _ _ _ _ _
NvSwitches:
    physicalId 12:
        X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
    physicalId 13:
        X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
    physicalId 9:
        X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
    physicalId 8:
        X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
    physicalId 10:
        X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
    physicalId 11:
        X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

Attached the output of nvidia-bug-report, with the hostnames redacted.
Attached the fabricmanager.log
Attached also output of dcgmi diag -r 3

Help, I don’t have anything left to try!
diag-out.txt (11.2 KB)
fabricmanager.log (64.5 KB)
nvidia-bug-report-redacted.log.gz (3.0 MB)

Topic		Replies	Views
CUDA device not initialized error on all calls, HGX A100, Centos 7 Linux cuda	9	4391	December 6, 2021
Nvidia fabric manger initializing CUDA H100 Drivers - Linux, Windows, MacOS cuda , nvbugs , python	1	334	July 4, 2024
System Not Initialized (ReturnCodes 802 and 83) CUDA Setup and Installation	6	6416	January 22, 2022
CUDA initialization error on 8x A100 GPU HGX server CUDA Setup and Installation	7	5904	November 4, 2023
Error running cuda on VM with GPU passthrough. cuda.get_device_name() returns 802, not initialized CUDA Setup and Installation	3	482	December 19, 2024
Error 802 at device access on an A100 node with CUDA 11.5 CUDA Setup and Installation	3	3411	November 15, 2022
Failed to run deviceQuery - cuda 10.2 Tesla V100 CUDA Setup and Installation	1	3729	December 2, 2019
"no CUDA-capable device is detected" with CUDA GPU attached CUDA Setup and Installation	1	11561	June 24, 2014
Debug error Jetson AGX Xavier cuda	9	4520	October 18, 2021
There is no device supporting CUDA CUDA Programming and Performance	11	22688	April 24, 2008

CUDA device not initialized error on all calls, HGX A100, Centos 7 (Crosspost from Linux Forum)

Related topics