Error 802: system not yet initialized CUDA 11.3

I’m trying to set up GPUs to work with cuda on AWS. This is the output of nvidia-smi

(base) ubuntu@ip-172-31-49-222:~$ nvidia-smi
Mon Nov 21 05:05:22 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:10:1C.0 Off |                    0 |
| N/A   23C    P0    40W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:10:1D.0 Off |                    0 |
| N/A   23C    P0    38W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:20:1C.0 Off |                    0 |
| N/A   23C    P0    39W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:20:1D.0 Off |                    0 |
| N/A   23C    P0    40W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:90:1C.0 Off |                    0 |
| N/A   23C    P0    40W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:90:1D.0 Off |                    0 |
| N/A   22C    P0    39W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:A0:1C.0 Off |                    0 |
| N/A   23C    P0    40W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:A0:1D.0 Off |                    0 |
| N/A   23C    P0    41W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

and my nvcc version is the following

(base) ubuntu@ip-172-31-49-222:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Mar_21_19:15:46_PDT_2021
Cuda compilation tools, release 11.3, V11.3.58
Build cuda_11.3.r11.3/compiler.29745058_0

I tried to run simpleAssert program that came with nvidia sample code and got the following error:

(base) ubuntu@ip-172-31-49-222:~/NVIDIA_CUDA-11.3_Samples/bin/x86_64/linux/release$ ./simpleAssert
simpleAssert starting...
OS_System_Type.release = 5.15.0-1022-aws
OS Info: <#26~20.04.1-Ubuntu SMP Sat Oct 15 03:22:07 UTC 2022>

CUDA error at ../../common/inc/helper_cuda.h:779 code=802(cudaErrorSystemNotReady) "cudaGetDeviceCount(&device_count)" 

I ran nvidia-bug-report.sh and this is the following file:
nvidia-bug-report.log (654.8 KB)

Error 802 generally indicates your system requires the NVLink/NVSwitch fabric manager, and you have not installed it.

This will generally be the case for 8-way A100 SXM systems. They require fabric manager to be installed.

fabric manager docs

Thanks for you response! I have actually already tried to install the fabric manager using sudo apt-get install cuda-drivers-fabricmanager-470. But I get the following error when I try to start it.

(base) ubuntu@ip-172-31-49-222:~$ sudo systemctl start nvidia-fabricmanager
Job for nvidia-fabricmanager.service failed because the control process exited with error code.
See "systemctl status nvidia-fabricmanager.service" and "journalctl -xe" for details.

and here is the error when I try to get the status

(base) ubuntu@ip-172-31-49-222:~$ systemctl status nvidia-fabricmanager.service
● nvidia-fabricmanager.service - NVIDIA fabric manager service
     Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Mon 2022-11-21 05:28:28 UTC; 1min 5s ago
    Process: 4481 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=1/FAILURE)

Nov 21 05:28:28 ip-172-31-49-222 systemd[1]: Starting NVIDIA fabric manager service...
Nov 21 05:28:28 ip-172-31-49-222 nv-fabricmanager[4483]: fabric manager NVIDIA GPU driver interface version 470.141.10 don't match with driver version 470.141.03. Please update with matching NVIDIA driver package.
Nov 21 05:28:28 ip-172-31-49-222 nv-fabricmanager[4483]: fabric manager NVIDIA GPU driver interface version 470.141.10 don't match with driver version 470.141.03. Please update with matching NVIDIA driver package.
Nov 21 05:28:28 ip-172-31-49-222 systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE
Nov 21 05:28:28 ip-172-31-49-222 systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
Nov 21 05:28:28 ip-172-31-49-222 systemd[1]: Failed to start NVIDIA fabric manager service.

It shows version mismatch. But I’m not sure how to install the 470.141.03 since the command I used to install doesn’t include a minor version.

The easy button here is probably to use an image that is set up already. NVIDIA provides some.

In case of 1 GPU card GTX1070 how to solve this problem?

The fabric manager is not needed on such a system and should not be installed on such a system. If it has been installed, remove it. There are no instructions for removal; you remove it like you would any other installed package via the package manager on your OS.

If that does not resolve the issue then the install history on your machine has corrupted the environment. In that case, the only suggestion I have is to reload the OS, and install CUDA using the available installers, following the available instructions.