I am having an issue where GPUs are not being detected by CUDA if a GPU application is currently running.
I have two GPU nodes running NVIDIA Driver Version: 418.87. One node (4x GP100’s) is having no issues running simultaneous GPU applications. However, the other node (8x Tesla P100’s) is running single jobs absolutely fine, however any additional GPU jobs cannot detect the devices. I have encountered this using VASP for GPU and the nbody simulations from the CUDA code samples. Both programs have been compiled using the CUDA Toolkit v10.0. I am wondering if this is due to an error with the driver - however the error only appeared in the last 2 weeks and the driver has been in use for about 3 months.
Below is the STDOUT+ERR from the vasp job. At the top is an echo of $CUDA_VISIBLE_DEVICES to show which GPUs are being assigned.
$CUDA_VISIBLE_DEVICES=4,5,6,7
[0] MPI startup(): Multi-threaded optimized library
[1] MPI startup(): shm data transfer mode
[0] MPI startup(): shm data transfer mode
[2] MPI startup(): shm data transfer mode
[3] MPI startup(): shm data transfer mode
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 13849 node0115 32
[0] MPI startup(): 1 13850 node0115 33
[0] MPI startup(): 2 13851 node0115 34
[0] MPI startup(): 3 13852 node0115 35
CUDA Error in cuda_main.cu, line 196: no CUDA-capable device is detected
No CUDA-supporting devices found!
CUDA Error in cuda_main.cu, line 196: no CUDA-capable device is detected
No CUDA-supporting devices found!
CUDA Error in cuda_main.cu, line 196: no CUDA-capable device is detected
No CUDA-supporting devices found!
CUDA Error in cuda_main.cu, line 196: no CUDA-capable device is detected
No CUDA-supporting devices found!
forrtl: severe (174): SIGSEGV, segmentation fault occurred
After CUDA fails to detect any devices, the program segfaults.
Using the CUDA code sample for the nbody simulation as a test, a similar thing is observed.
$CUDA_VISIBLE_DEVICES=4,5,6,7
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
-fullscreen (run n-body simulation in fullscreen mode)
-fp64 (use double precision floating point values for simulation)
-hostmem (stores simulation data in host memory)
-benchmark (run benchmark to measure performance)
-numbodies=<N> (number of bodies (>= 1) to run in simulation)
-device=<d> (where d=0,1,2.... for the CUDA device to use)
-numdevices=<i> (where i=(number of CUDA devices > 0) to use for simulation)
-compare (compares simulation results running once on the default GPU and once on the CPU)
-cpu (run n-body simulation on the CPU)
-tipsy=<file.bin> (load a tipsy model file for simulation)
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
number of CUDA devices = 4
Error: only 0 Devices available, 4 requested. Exiting.
For the conditions of both of these jobs, an application was being run on the first 4 GPUs which was either a VASP calculation or an nbody simulation. Below is a sample of the nvidia-smi output for the conditions of the node when the job executions were attempted.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-SXM2... Off | 00000000:04:00.0 Off | 0 |
| N/A 49C P0 120W / 300W | 3447MiB / 16280MiB | 96% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-SXM2... Off | 00000000:05:00.0 Off | 0 |
| N/A 41C P0 119W / 300W | 2758MiB / 16280MiB | 96% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-SXM2... Off | 00000000:09:00.0 Off | 0 |
| N/A 41C P0 129W / 300W | 3447MiB / 16280MiB | 96% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-SXM2... Off | 00000000:0A:00.0 Off | 0 |
| N/A 47C P0 132W / 300W | 2578MiB / 16280MiB | 96% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla P100-SXM2... Off | 00000000:8C:00.0 Off | 0 |
| N/A 36C P0 43W / 300W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla P100-SXM2... Off | 00000000:8D:00.0 Off | 0 |
| N/A 31C P0 40W / 300W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla P100-SXM2... Off | 00000000:90:00.0 Off | 0 |
| N/A 30C P0 41W / 300W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla P100-SXM2... Off | 00000000:91:00.0 Off | 0 |
| N/A 30C P0 41W / 300W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 15583 C vasp_gpu 687MiB |
| 0 15584 C vasp_gpu 687MiB |
| 0 15585 C vasp_gpu 687MiB |
| 0 15586 C vasp_gpu 687MiB |
| 0 15587 C vasp_gpu 687MiB |
| 1 15588 C vasp_gpu 687MiB |
| 1 15589 C vasp_gpu 687MiB |
| 1 15590 C vasp_gpu 687MiB |
| 1 15591 C vasp_gpu 627MiB |
| 2 15592 C vasp_gpu 687MiB |
| 2 15593 C vasp_gpu 687MiB |
| 2 15594 C vasp_gpu 687MiB |
| 2 15595 C vasp_gpu 687MiB |
| 2 15596 C vasp_gpu 687MiB |
| 3 15597 C vasp_gpu 627MiB |
| 3 15598 C vasp_gpu 627MiB |
| 3 15599 C vasp_gpu 627MiB |
| 3 15600 C vasp_gpu 627MiB |
+-----------------------------------------------------------------------------+