Hi,
I am attempting to set up a HGX A100 for use in a single node Kubernetes cluster.
The issue I am stuck on is just interacting with the GPUs from the host, ignoring docker or kubernetes.
I get a CUDA initialization error:
- When running dcgmi diag -r 3: A variety of messages (attached) indicating there was a cuda initialisation error
- When running the cuda-sample ./deviceQuery:
deviceQuery:cudaGetDeviceCount returned 3
→ initialization error
Result = FAIL - When running pyopencl or another library calling opencl: no platforms are detected
This indicates that there’s an issue because “the CUDA driver and runtime could not be initialized.?”
But I can’t see why that would be the case:
The drivers are all the same version, installed using yum package manager: 460.106.00
Fabricmanager seems to be working
We’ve restarted the host and disabled docker in case of a conflict.[diag-out.txt|attachment]
We have tried the 470 drivers as well, but had the same issue.
Initially we did not have fabricmanager installed, installing it got us to this point.
The only oddity is that nvlink does not seem to be working, the output of dcgmi nvlink --link-status is below. But I don’t think this is necessary?
+----------------------+
| NvLink Link Status |
+----------------------+
GPUs:
gpuId 0:
_ _ _ _ _ _ _ _ _ _ _ _
gpuId 1:
_ _ _ _ _ _ _ _ _ _ _ _
gpuId 2:
_ _ _ _ _ _ _ _ _ _ _ _
gpuId 3:
_ _ _ _ _ _ _ _ _ _ _ _
gpuId 4:
_ _ _ _ _ _ _ _ _ _ _ _
gpuId 5:
_ _ _ _ _ _ _ _ _ _ _ _
gpuId 6:
_ _ _ _ _ _ _ _ _ _ _ _
gpuId 7:
_ _ _ _ _ _ _ _ _ _ _ _
NvSwitches:
physicalId 12:
X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
physicalId 13:
X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
physicalId 9:
X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
physicalId 8:
X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
physicalId 10:
X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
physicalId 11:
X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
Attached the output of nvidia-bug-report, with the hostnames redacted.
Attached the fabricmanager.log
Attached also output of dcgmi diag -r 3
Help, I don’t have anything left to try!
(upload://jPaeC1HB7AWxRbuCeWaOZjHIYY4.txt) (11.2 KB)
fabricmanager.log (64.5 KB)
nvidia-bug-report-redacted.log.gz (3.0 MB)
edit: not sure the diagnostics output worked, so I’m attaching it for 0,1 inline (they’re all the same, just for brevity)
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic | Result |
+===========================+================================================+
|----- Deployment --------+------------------------------------------------|
| Blacklist | Pass |
| NVML Library | Pass |
| CUDA Main Library | Pass |
| Permissions and OS Blocks | Pass |
| Persistence Mode | Pass |
| Environment Variables | Pass |
| Page Retirement/Row Remap | Pass |
| Graphics Processes | Pass |
| Inforom | Pass |
+----- Integration -------+------------------------------------------------+
| pcie | Fail - All |
| Warning | GPU 0: Error using CUDA API cudaDeviceGetByPC |
| | IBusId 'initialization error' for GPU 0, bus |
| | ID = 00000000:07:00.0 |
| Warning | GPU 1: Error using CUDA API cudaDeviceGetByPC |
| | IBusId 'initialization error' for GPU 0, bus |
| | ID = 00000000:07:00.0 |
+----- Hardware ----------+------------------------------------------------+
| GPU Memory | Fail - All |
| Warning | GPU 0: Error using CUDA API cuInit Unable to |
| | initialize CUDA library: 'initialization erro |
| | r'.; verify that the fabric-manager has been |
| | started if applicable |
| Warning | GPU 1: Error using CUDA API cuInit Unable to |
| | initialize CUDA library: 'initialization erro |
| | r'.; verify that the fabric-manager has been |
| | started if applicable |
| diagnostic | Fail - All |
| Warning | GPU 0: API call cudaGetDeviceCount failed for |
| | GPU 0: 'initialization error', GPU 0: There |
| | was an internal error during the test: 'Faile |
| | d to initialize the plugin.' |
| Warning | GPU 1: There was an internal error during the |
| | test: 'Failed to initialize the plugin.' |
+----- Stress ------------+------------------------------------------------+
| sm_stress | Fail - All |
| Warning | GPU 0: There was an internal error during the |
| | test: 'Couldn't initialize the plugin, pleas |
| | e check the log file.', GPU 0: API call cudaG |
| | etDeviceCount failed for GPU 0: 'initializati |
| | on error', GPU 0: Error using CUDA API cudaDe |
| | viceGetByPCIBusId 'initialization error' for |
| | GPU 0, bus ID = 00000000:07:00.0 |
| Warning | GPU 1: There was an internal error during the |
| | test: 'Couldn't initialize the plugin, pleas |
| | e check the log file.', GPU 1: Error using CU |
| | DA API cudaDeviceGetByPCIBusId 'initializatio |
| | n error' for GPU 0, bus ID = 00000000:07:00.0 |
| targeted_stress | Fail - All |
| Warning | GPU 0: API call cudaGetDeviceCount failed for |
| | GPU 0: 'initialization error' |
| targeted_power | Pass - All |
| memory_bandwidth | Fail - All |
| Warning | GPU 0: API call cuInit failed for GPU 0: 'ini |
| | tialization error; verify that the fabric-man |
| | ager has been started if applicable' |
+---------------------------+------------------------------------------------+