Unable to run CUDA programs in Singularity containers from NGC

tari · October 19, 2020, 9:32pm

I am experiencing some problems running programs compiled in a container built from an NGC CUDA image:

$ srun singularity pull docker://nvcr.io/nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04

I am running my attempts to compile and run software through a Slurm queue on a DGX-2 system.
I have downloaded the CUDA Toolkit code samples which I keep in ‘~/NVIDIA_CUDA-11.1_Samples’.
I pick some random code sample: ‘~/NVIDIA_CUDA-11.1_Samples/0_Simple/matrixMul/’ and compile that in my container:

$ srun --gres=gpu:1 singularity exec --nv cuda_11.1-cudnn8-devel-ubuntu18.04.sif make

The build succeeds and I now have the executable matrixMul. Now I try to execute matrixMul in my container:

$ srun --gres=gpu:1 singularity exec --nv ~/image-build/cuda_11.1/cuda_11.1-cudnn8-devel-ubuntu18.04.sif ./matrixMul
slurmstepd: task_p_pre_launch: Using sched_affinity for tasks
[Matrix Multiply Using CUDA] - Starting...
CUDA error at ../../common/inc/helper_cuda.h:779 code=3(cudaErrorInitializationError) "cudaGetDeviceCount(&device_count)" 
srun: error: nv-ai-01.srv.aau.dk: task 0: Exited with exit code 1

It fails and the error message could indicate that it cannot see any available GPUs. If I try to probe a bit more around with nvidia-smi, I do seem to have a GPU available in the container:

$ srun --gres=gpu:1 singularity exec --nv ~/image-build/cuda_11.1/cuda_11.1-cudnn8-devel-ubuntu18.04.sif nvidia-smi
slurmstepd: task_p_pre_launch: Using sched_affinity for tasks
Mon Oct 19 23:26:07 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.116.00   Driver Version: 418.116.00   CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM3...  On   | 00000000:BE:00.0 Off |                    0 |
| N/A   39C    P0    51W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I get suspicious here because nvidia-smi says CUDA is version 10.1, but I am running it inside a container which I expect to provide CUDA 11.1.
I expect the cuda:11.1-cudnn8-devel-ubuntu18.04 container to provide me with the necessary drivers and CUDA toolkit to compile and run CUDA applications. Compiling seems to work fine, but running - not so much…
What could be wrong here?

tari · October 20, 2020, 2:09pm

I have figured out what I did wrong. I was making wrong assumptions indeed: the driver is provided by the host OS. Since the driver is an older version that CUDA 11.1 is not compatible with, I solved this by using the CUDA 10.0 / cuDNN 7 version of this image instead and then I am able to both compile and run the samples. More importantly, CUDA version 10.0 is just enough for the actual software I was eventually interested in running, so this solves my problem.

Topic		Replies	Views
How to use CUDA compatibility package to use a newer driver on an older kernel module CUDA Setup and Installation	8	5076	July 8, 2019
CUDA driver version is insufficient for CUDA runtime version CUDA Setup and Installation	7	33166	May 18, 2024
CUDA driver issues DGX-1 ?? Container: CUDA	5	1604	October 12, 2021
Running chroma container via Singularity failed: There is no device supporting CUDA Container: HPC	1	1052	February 28, 2019
Problems installing CUDA drivers for systemd containers CUDA Setup and Installation cuda , kernel , ubuntu , linux-driver	0	972	September 21, 2022
Modulus container no longer functions after updating to latest display + cuda drivers Technical Support (PhysicsNeMo Only) cuda , driver , rhel	3	1585	November 4, 2022
Compiling with NVCC in a Docker container under CUDA for WSL CUDA on Windows Subsystem for Linux	7	9006	August 5, 2022
Cannot build CUDA 4.0 SDK samples on Ubuntu 10.10 CUDA Programming and Performance	16	24784	March 12, 2012
Instaling cuda 12.5 i have 12.3 CUDA Setup and Installation	2	620	June 20, 2024
Issues running CUDA on GPU cluster CUDA Programming and Performance	3	909	February 18, 2017

Unable to run CUDA programs in Singularity containers from NGC

Related topics