I’m trying to set up GPUs to work with cuda on AWS. This is the output of nvidia-smi
(base) ubuntu@ip-172-31-49-222:~$ nvidia-smi
Mon Nov 21 05:05:22 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:10:1C.0 Off | 0 |
| N/A 23C P0 40W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:10:1D.0 Off | 0 |
| N/A 23C P0 38W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:20:1C.0 Off | 0 |
| N/A 23C P0 39W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... On | 00000000:20:1D.0 Off | 0 |
| N/A 23C P0 40W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM... On | 00000000:90:1C.0 Off | 0 |
| N/A 23C P0 40W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM... On | 00000000:90:1D.0 Off | 0 |
| N/A 22C P0 39W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM... On | 00000000:A0:1C.0 Off | 0 |
| N/A 23C P0 40W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM... On | 00000000:A0:1D.0 Off | 0 |
| N/A 23C P0 41W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
and my nvcc version is the following
(base) ubuntu@ip-172-31-49-222:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Mar_21_19:15:46_PDT_2021
Cuda compilation tools, release 11.3, V11.3.58
Build cuda_11.3.r11.3/compiler.29745058_0
I tried to run simpleAssert
program that came with nvidia sample code and got the following error:
(base) ubuntu@ip-172-31-49-222:~/NVIDIA_CUDA-11.3_Samples/bin/x86_64/linux/release$ ./simpleAssert
simpleAssert starting...
OS_System_Type.release = 5.15.0-1022-aws
OS Info: <#26~20.04.1-Ubuntu SMP Sat Oct 15 03:22:07 UTC 2022>
CUDA error at ../../common/inc/helper_cuda.h:779 code=802(cudaErrorSystemNotReady) "cudaGetDeviceCount(&device_count)"
I ran nvidia-bug-report.sh
and this is the following file:
nvidia-bug-report.log (654.8 KB)