Error 802 at device access on an A100 node with CUDA 11.5

Hello,

We are attempting to configure a 8X A100 node running RedHat Linux 8. The details of the driver and CUDA installation are shown below.

| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |

All 8 of the GPUs are visible through NVIDIA SMI, but we have been unable to spawn any work to the devices. To rule out our user code we tried the “nvaccelinfo” utility from the NVIDIA HPC SDK and the “bandwidthTest” from the CUDA samples github repository. Both return the same error code 802:

OUTPUT FROM NVACCELINFO:
[gdavid@sj-numecagpu-01 ~]$ /home/gdavid/Compilers/Linux_x86_64/21.2/compilers/bin/nvaccelinfo -v
CUDA Driver Version: 11050
NVRM version: NVIDIA UNIX x86_64 Kernel Module 495.29.05 Thu Sep 30 16:00:29 UTC 2021
could not initialize CUDA runtime, error code=802
No accelerators found.
Check the permissions on your CUDA device

OUTPUT FROM CUDA SAMPLES (BANDWIDTHTEST)
[gdavid@sj-numecagpu-01 bandwidthTest]$ ./bandwidthTest
[CUDA Bandwidth Test] - Starting…
Running on…
cudaGetDeviceProperties returned 802
→ system not yet initialized
CUDA error at bandwidthTest.cu:256 code=802(cudaErrorSystemNotReady) “cudaSetDevice(currentDevice)”

We are not quite sure what to make of this error. Any advice you could provide would be very helpful.

Thanks in advance,

-David

1 Like

We figured this out.

For future reference: we needed to start the fabric manager service. This is unexpected as we do not have NVSwitch on this node. It is working perfectly now.

Hi, I am having the same issue on 8X A100 node running Ubuntu 20.04. When I attempted to start nvidia-fabricmanager service, it failed and gave me the following error

nv-fabricmanager[138920]: failed to acquire required privileges to access NVSwitch devices. make sure fabric manager has access permissions to required device node files

Any clues on how to solve this?

Thank you.