Error 802: system not yet initialized CUDA 11.3

I’m trying to set up GPUs to work with cuda on AWS. This is the output of nvidia-smi

(base) ubuntu@ip-172-31-49-222:~$ nvidia-smi
Mon Nov 21 05:05:22 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:10:1C.0 Off |                    0 |
| N/A   23C    P0    40W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:10:1D.0 Off |                    0 |
| N/A   23C    P0    38W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:20:1C.0 Off |                    0 |
| N/A   23C    P0    39W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:20:1D.0 Off |                    0 |
| N/A   23C    P0    40W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:90:1C.0 Off |                    0 |
| N/A   23C    P0    40W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:90:1D.0 Off |                    0 |
| N/A   22C    P0    39W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:A0:1C.0 Off |                    0 |
| N/A   23C    P0    40W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:A0:1D.0 Off |                    0 |
| N/A   23C    P0    41W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

and my nvcc version is the following

(base) ubuntu@ip-172-31-49-222:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Mar_21_19:15:46_PDT_2021
Cuda compilation tools, release 11.3, V11.3.58
Build cuda_11.3.r11.3/compiler.29745058_0

I tried to run simpleAssert program that came with nvidia sample code and got the following error:

(base) ubuntu@ip-172-31-49-222:~/NVIDIA_CUDA-11.3_Samples/bin/x86_64/linux/release$ ./simpleAssert
simpleAssert starting...
OS_System_Type.release = 5.15.0-1022-aws
OS Info: <#26~20.04.1-Ubuntu SMP Sat Oct 15 03:22:07 UTC 2022>

CUDA error at ../../common/inc/helper_cuda.h:779 code=802(cudaErrorSystemNotReady) "cudaGetDeviceCount(&device_count)" 

I ran nvidia-bug-report.sh and this is the following file:
nvidia-bug-report.log (654.8 KB)

Error 802 generally indicates your system requires the NVLink/NVSwitch fabric manager, and you have not installed it.

This will generally be the case for 8-way A100 SXM systems. They require fabric manager to be installed.

fabric manager docs

Thanks for you response! I have actually already tried to install the fabric manager using sudo apt-get install cuda-drivers-fabricmanager-470. But I get the following error when I try to start it.

(base) ubuntu@ip-172-31-49-222:~$ sudo systemctl start nvidia-fabricmanager
Job for nvidia-fabricmanager.service failed because the control process exited with error code.
See "systemctl status nvidia-fabricmanager.service" and "journalctl -xe" for details.

and here is the error when I try to get the status

(base) ubuntu@ip-172-31-49-222:~$ systemctl status nvidia-fabricmanager.service
● nvidia-fabricmanager.service - NVIDIA fabric manager service
     Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Mon 2022-11-21 05:28:28 UTC; 1min 5s ago
    Process: 4481 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=1/FAILURE)

Nov 21 05:28:28 ip-172-31-49-222 systemd[1]: Starting NVIDIA fabric manager service...
Nov 21 05:28:28 ip-172-31-49-222 nv-fabricmanager[4483]: fabric manager NVIDIA GPU driver interface version 470.141.10 don't match with driver version 470.141.03. Please update with matching NVIDIA driver package.
Nov 21 05:28:28 ip-172-31-49-222 nv-fabricmanager[4483]: fabric manager NVIDIA GPU driver interface version 470.141.10 don't match with driver version 470.141.03. Please update with matching NVIDIA driver package.
Nov 21 05:28:28 ip-172-31-49-222 systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE
Nov 21 05:28:28 ip-172-31-49-222 systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
Nov 21 05:28:28 ip-172-31-49-222 systemd[1]: Failed to start NVIDIA fabric manager service.

It shows version mismatch. But I’m not sure how to install the 470.141.03 since the command I used to install doesn’t include a minor version.

The easy button here is probably to use an image that is set up already. NVIDIA provides some.