CUDA_ERROR_NOT_INITIALIZED return from cuInit()

My O/S is Rocky Linux 8.5, driver version 495.29.05. I also tried 510.47.03.

Here is my sample program:

#include "cuda.h"
#include <iostream>

int main()
{
    const CUresult res = cuInit(0);
    const char *s = 0;
    std::cout << "cuGetErrorString returns " << cuGetErrorString(res, &s) << "\n";
    std::cout << "cuInit result: " << s << "\n";
}

Here is what happens when I run it on most of my systems; i.e., what I expect:

cuGetErrorString returns 0
cuInit result: no error

But here is what happens on a brand-new 8x A100 system I just installed:

cuGetErrorString returns 0
cuInit result: system not yet initialized

Note that “system not yet initialized” is CUDA_ERROR_NOT_INITIALIZED, which is not listed as a possible return for cuInit() in the documentation.

I am seeing this error on everything I try to run (Tensorflow, dcgmi diag, etc.)

There are no errors or warnings in the kernel logs as far as I can tell.

The driver itself is loading fine. nvidia-smi shows:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:27:00.0 Off |                    0 |
| N/A   24C    P0    53W / 400W |      0MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:2A:00.0 Off |                    0 |
| N/A   23C    P0    51W / 400W |      0MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:51:00.0 Off |                    0 |
| N/A   23C    P0    52W / 400W |      0MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:57:00.0 Off |                    0 |
| N/A   25C    P0    50W / 400W |      0MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:9E:00.0 Off |                    0 |
| N/A   26C    P0    53W / 400W |      0MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:A4:00.0 Off |                    0 |
| N/A   24C    P0    51W / 400W |      0MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:C7:00.0 Off |                    0 |
| N/A   24C    P0    49W / 400W |      0MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:CA:00.0 Off |                    0 |
| N/A   26C    P0    52W / 400W |      0MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Any help debugging this would be appreciated.

You seem to have an sxm (nvswitch) system, please start the fabric manager (and also nvidia persistenced).

@generix Thank you! That was the problem.

This is even documented in the Fabric Manager User’s Guide. I would still say the documentation for cuInit is out of date.

Anyway I am unblocked. Thanks again.