Failing to detect devices when CUDA applications are run

I am having an issue where GPUs are not being detected by CUDA if a GPU application is currently running.

I have two GPU nodes running NVIDIA Driver Version: 418.87. One node (4x GP100’s) is having no issues running simultaneous GPU applications. However, the other node (8x Tesla P100’s) is running single jobs absolutely fine, however any additional GPU jobs cannot detect the devices. I have encountered this using VASP for GPU and the nbody simulations from the CUDA code samples. Both programs have been compiled using the CUDA Toolkit v10.0. I am wondering if this is due to an error with the driver - however the error only appeared in the last 2 weeks and the driver has been in use for about 3 months.

Below is the STDOUT+ERR from the vasp job. At the top is an echo of $CUDA_VISIBLE_DEVICES to show which GPUs are being assigned.

$CUDA_VISIBLE_DEVICES=4,5,6,7
[0] MPI startup(): Multi-threaded optimized library
[1] MPI startup(): shm data transfer mode
[0] MPI startup(): shm data transfer mode
[2] MPI startup(): shm data transfer mode
[3] MPI startup(): shm data transfer mode
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       13849    node0115   32
[0] MPI startup(): 1       13850    node0115   33
[0] MPI startup(): 2       13851    node0115   34
[0] MPI startup(): 3       13852    node0115   35

CUDA Error in cuda_main.cu, line 196: no CUDA-capable device is detected
 No CUDA-supporting devices found!

CUDA Error in cuda_main.cu, line 196: no CUDA-capable device is detected
 No CUDA-supporting devices found!

CUDA Error in cuda_main.cu, line 196: no CUDA-capable device is detected
 No CUDA-supporting devices found!

CUDA Error in cuda_main.cu, line 196: no CUDA-capable device is detected
 No CUDA-supporting devices found!
forrtl: severe (174): SIGSEGV, segmentation fault occurred

After CUDA fails to detect any devices, the program segfaults.

Using the CUDA code sample for the nbody simulation as a test, a similar thing is observed.

$CUDA_VISIBLE_DEVICES=4,5,6,7
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
        -fullscreen       (run n-body simulation in fullscreen mode)
        -fp64             (use double precision floating point values for simulation)
        -hostmem          (stores simulation data in host memory)
        -benchmark        (run benchmark to measure performance)
        -numbodies=<N>    (number of bodies (>= 1) to run in simulation)
        -device=<d>       (where d=0,1,2.... for the CUDA device to use)
        -numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
        -compare          (compares simulation results running once on the default GPU and once on the CPU)
        -cpu              (run n-body simulation on the CPU)
        -tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

number of CUDA devices  = 4
Error: only 0 Devices available, 4 requested.  Exiting.

For the conditions of both of these jobs, an application was being run on the first 4 GPUs which was either a VASP calculation or an nbody simulation. Below is a sample of the nvidia-smi output for the conditions of the node when the job executions were attempted.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-SXM2...  Off  | 00000000:04:00.0 Off |                    0 |
| N/A   49C    P0   120W / 300W |   3447MiB / 16280MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-SXM2...  Off  | 00000000:05:00.0 Off |                    0 |
| N/A   41C    P0   119W / 300W |   2758MiB / 16280MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-SXM2...  Off  | 00000000:09:00.0 Off |                    0 |
| N/A   41C    P0   129W / 300W |   3447MiB / 16280MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-SXM2...  Off  | 00000000:0A:00.0 Off |                    0 |
| N/A   47C    P0   132W / 300W |   2578MiB / 16280MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla P100-SXM2...  Off  | 00000000:8C:00.0 Off |                    0 |
| N/A   36C    P0    43W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla P100-SXM2...  Off  | 00000000:8D:00.0 Off |                    0 |
| N/A   31C    P0    40W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla P100-SXM2...  Off  | 00000000:90:00.0 Off |                    0 |
| N/A   30C    P0    41W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla P100-SXM2...  Off  | 00000000:91:00.0 Off |                    0 |
| N/A   30C    P0    41W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     15583      C   vasp_gpu                                     687MiB |
|    0     15584      C   vasp_gpu                                     687MiB |
|    0     15585      C   vasp_gpu                                     687MiB |
|    0     15586      C   vasp_gpu                                     687MiB |
|    0     15587      C   vasp_gpu                                     687MiB |
|    1     15588      C   vasp_gpu                                     687MiB |
|    1     15589      C   vasp_gpu                                     687MiB |
|    1     15590      C   vasp_gpu                                     687MiB |
|    1     15591      C   vasp_gpu                                     627MiB |
|    2     15592      C   vasp_gpu                                     687MiB |
|    2     15593      C   vasp_gpu                                     687MiB |
|    2     15594      C   vasp_gpu                                     687MiB |
|    2     15595      C   vasp_gpu                                     687MiB |
|    2     15596      C   vasp_gpu                                     687MiB |
|    3     15597      C   vasp_gpu                                     627MiB |
|    3     15598      C   vasp_gpu                                     627MiB |
|    3     15599      C   vasp_gpu                                     627MiB |
|    3     15600      C   vasp_gpu                                     627MiB |
+-----------------------------------------------------------------------------+

I feel kind of silly having written this and then testing a few things and already finding a solution. I thought maybe the error was due to the driver unloading itself from the idle GPUs and then a latency from the first CUDA call due to the driver reloading. This may have been correct because enabling persistenced mode “nvidia-smi -pm 1” seems to have fixed the issue on the 8 GPU node.