Hello,
We are attempting to configure a 8X A100 node running RedHat Linux 8. The details of the driver and CUDA installation are shown below.
| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
All 8 of the GPUs are visible through NVIDIA SMI, but we have been unable to spawn any work to the devices. To rule out our user code we tried the “nvaccelinfo” utility from the NVIDIA HPC SDK and the “bandwidthTest” from the CUDA samples github repository. Both return the same error code 802:
OUTPUT FROM NVACCELINFO:
[gdavid@sj-numecagpu-01 ~]$ /home/gdavid/Compilers/Linux_x86_64/21.2/compilers/bin/nvaccelinfo -v
CUDA Driver Version: 11050
NVRM version: NVIDIA UNIX x86_64 Kernel Module 495.29.05 Thu Sep 30 16:00:29 UTC 2021
could not initialize CUDA runtime, error code=802
No accelerators found.
Check the permissions on your CUDA device
OUTPUT FROM CUDA SAMPLES (BANDWIDTHTEST)
[gdavid@sj-numecagpu-01 bandwidthTest]$ ./bandwidthTest
[CUDA Bandwidth Test] - Starting…
Running on…
cudaGetDeviceProperties returned 802
→ system not yet initialized
CUDA error at bandwidthTest.cu:256 code=802(cudaErrorSystemNotReady) “cudaSetDevice(currentDevice)”
We are not quite sure what to make of this error. Any advice you could provide would be very helpful.
Thanks in advance,
-David