I am experiencing some problems running programs compiled in a container built from an NGC CUDA image:
$ srun singularity pull docker://nvcr.io/nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04
I am running my attempts to compile and run software through a Slurm queue on a DGX-2 system.
I have downloaded the CUDA Toolkit code samples which I keep in ‘~/NVIDIA_CUDA-11.1_Samples’.
I pick some random code sample: ‘~/NVIDIA_CUDA-11.1_Samples/0_Simple/matrixMul/’ and compile that in my container:
$ srun --gres=gpu:1 singularity exec --nv cuda_11.1-cudnn8-devel-ubuntu18.04.sif make
The build succeeds and I now have the executable matrixMul
. Now I try to execute matrixMul
in my container:
$ srun --gres=gpu:1 singularity exec --nv ~/image-build/cuda_11.1/cuda_11.1-cudnn8-devel-ubuntu18.04.sif ./matrixMul
slurmstepd: task_p_pre_launch: Using sched_affinity for tasks
[Matrix Multiply Using CUDA] - Starting...
CUDA error at ../../common/inc/helper_cuda.h:779 code=3(cudaErrorInitializationError) "cudaGetDeviceCount(&device_count)"
srun: error: nv-ai-01.srv.aau.dk: task 0: Exited with exit code 1
It fails and the error message could indicate that it cannot see any available GPUs. If I try to probe a bit more around with nvidia-smi
, I do seem to have a GPU available in the container:
$ srun --gres=gpu:1 singularity exec --nv ~/image-build/cuda_11.1/cuda_11.1-cudnn8-devel-ubuntu18.04.sif nvidia-smi
slurmstepd: task_p_pre_launch: Using sched_affinity for tasks
Mon Oct 19 23:26:07 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.116.00 Driver Version: 418.116.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM3... On | 00000000:BE:00.0 Off | 0 |
| N/A 39C P0 51W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I get suspicious here because nvidia-smi
says CUDA is version 10.1, but I am running it inside a container which I expect to provide CUDA 11.1.
I expect the cuda:11.1-cudnn8-devel-ubuntu18.04
container to provide me with the necessary drivers and CUDA toolkit to compile and run CUDA applications. Compiling seems to work fine, but running - not so much…
What could be wrong here?