My O/S is Rocky Linux 8.5, driver version 495.29.05. I also tried 510.47.03.
Here is my sample program:
#include "cuda.h"
#include <iostream>
int main()
{
const CUresult res = cuInit(0);
const char *s = 0;
std::cout << "cuGetErrorString returns " << cuGetErrorString(res, &s) << "\n";
std::cout << "cuInit result: " << s << "\n";
}
Here is what happens when I run it on most of my systems; i.e., what I expect:
cuGetErrorString returns 0
cuInit result: no error
But here is what happens on a brand-new 8x A100 system I just installed:
cuGetErrorString returns 0
cuInit result: system not yet initialized
Note that “system not yet initialized” is CUDA_ERROR_NOT_INITIALIZED, which is not listed as a possible return for cuInit() in the documentation.
I am seeing this error on everything I try to run (Tensorflow, dcgmi diag
, etc.)
There are no errors or warnings in the kernel logs as far as I can tell.
The driver itself is loading fine. nvidia-smi
shows:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:27:00.0 Off | 0 |
| N/A 24C P0 53W / 400W | 0MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:2A:00.0 Off | 0 |
| N/A 23C P0 51W / 400W | 0MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:51:00.0 Off | 0 |
| N/A 23C P0 52W / 400W | 0MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... On | 00000000:57:00.0 Off | 0 |
| N/A 25C P0 50W / 400W | 0MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM... On | 00000000:9E:00.0 Off | 0 |
| N/A 26C P0 53W / 400W | 0MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM... On | 00000000:A4:00.0 Off | 0 |
| N/A 24C P0 51W / 400W | 0MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM... On | 00000000:C7:00.0 Off | 0 |
| N/A 24C P0 49W / 400W | 0MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM... On | 00000000:CA:00.0 Off | 0 |
| N/A 26C P0 52W / 400W | 0MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Any help debugging this would be appreciated.