I have a clean installation of CUDA-10.1 on an Ubuntu 18.04 LTS, headless server. This is the output of nvidia-smi
:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.95.01 Driver Version: 440.95.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K40m Off | 00000000:82:00.0 Off | 0 |
| N/A 28C P0 61W / 235W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K40m Off | 00000000:C2:00.0 Off | 0 |
| N/A 30C P0 63W / 235W | 0MiB / 11441MiB | 41% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I compile the following simple program:
#include <cuda_runtime.h>
int main(void)
{
cudaError_t err = cudaSuccess;
float *buf = NULL;
err = cudaMalloc((void **)&buf, 1000);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to allocate device memory (error code %s)!\n", cudaGetErrorString(err));
return EXIT_FAILURE;
}
err = cudaFree(buf);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to free device memory (error code %s)!\n", cudaGetErrorString(err));
return EXIT_FAILURE;
}
return EXIT_FAILURE;
}
Then I run it manually several times and it fails intermittently:
***@***:~/cuda$ ./test_cuda_malloc
***@***:~/cuda$ ./test_cuda_malloc Failed to allocate device memory (error code all CUDA-capable devices are busy or unavailable)!
***@***:~/cuda$ ./test_cuda_malloc Failed to allocate device memory (error code all CUDA-capable devices are busy or unavailable)!
***@***:~/cuda$ ./test_cuda_malloc Failed to allocate device memory (error code all CUDA-capable devices are busy or unavailable)!
***@***:~/cuda$ ./test_cuda_malloc
***@***:~/cuda$ ./test_cuda_malloc Failed to allocate device memory (error code all CUDA-capable devices are busy or unavailable)!
***@***:~/cuda$
I checked a lot of output of nvidia-smi
, /var/log/kern
, and /var/log/syslog
and there is nothing to help me track down the problem. Disabling one of the cards didn’t help. The host is not used during these experiments and there is nothing that uses the NVidias.
What are steps to troubleshoot this intermittent failure?