Singularity image of Relion

igor.kozin · September 21, 2018, 3:04pm

I’ve been trying to get Singularity image of Relion Docker container (GPU-optimized AI, Machine Learning, & HPC Software | NVIDIA NGC) working on our HPC cluster with P100 and V100 nodes.
I was able to create the image (it’s very easy) and even to run it on my workstation with 2x 1080Ti. No problems.
But when I run it on the cluster I get

ERROR: all CUDA-capable devices are busy or unavailable in /opt/relion-sm70/src/gpu_utils/cuda_projector.cu at line 115 (error-code 46)

which is not true. Even if I run nvidia-smi through the very same image it tells me all four GPUs are free. The same error happens on P100 and V100 nodes. My workstation isn’t really much different from the cluster environment. Both are CentOS 7, the same driver version:
NVIDIA-SMI 396.26 Driver Version: 396.26

Any hints are most welcome. I can send the full output if necessary. The error happens straight after

Running CPU instructions in double precision.

On host gpu03.prv.davros.compute.estate: free scratch space = 539 Gb.
Copying particles to scratch directory: /scratch/tmp.14368/relion_volatile/
1.30/1.30 min …(,_,">
Estimating initial noise spectra
58/ 58 sec …(,_,">
CurrentResolution= 60.2998 Angstroms, which requires orientationSampling of at least 18.9474 degrees for a particle of diameter 360 Angstroms
Oversampling= 0 NrHiddenVariableSamplingPoints= 580608
OrientationalSampling= 15 NrOrientations= 4608
TranslationalSampling= 2 NrTranslations= 21
=============================
Oversampling= 1 NrHiddenVariableSamplingPoints= 18579456
OrientationalSampling= 7.5 NrOrientations= 36864
TranslationalSampling= 1 NrTranslations= 84
=============================

Thanks in advance!

igor.kozin · September 25, 2018, 12:15pm

The issue appears to be related to Compute Mode settings of the cards. The workstation cards were set to Default while the HPC cluster cards were set to Exclusive_Process. Changing them to Default resolved the problem.

underoath006 · October 6, 2020, 7:33pm

Hi I’m new to singularity, I built the same image with singularity build --name my_relion.simg docker://nvcr.io/hpc/relion:3.1.0

I tried ./my_relion.simg and singularity run --nv my_relion.simg, it says relion command not found! I would appreciate your input!

Topic		Replies	Views
Using Singularity for the docker image Parabricks	4	1053	October 4, 2022
Running CUDA Program on Cluster doesn't work CUDA Programming and Performance	1	744	December 7, 2012
CUDA error: all CUDA-capable devices are busy or unavailable CUDA Setup and Installation pytorch , wsl	5	6789	April 19, 2021
Modulus container no longer functions after updating to latest display + cuda drivers Technical Support (Modulus Only) cuda , driver , rhel	3	1541	November 4, 2022
Varying success when running cuda_samples post install CUDA Setup and Installation	2	714	June 26, 2017
How to make OpenCL works on cluster with Nvidia Tesla? CUDA Setup and Installation	3	2418	March 23, 2014
Please when I use docker on wsl2，prompts me libnvidia-m1.so.1 not found CUDA Setup and Installation boot , cuda	0	365	August 10, 2023
Nvidia-container-cli: detection error: nvml error: function not found: unknown CUDA Programming and Performance cuda , ubuntu , docker	5	7802	April 24, 2021
Docker: Error response from daemon: OCI runtime create failed CUDA on Windows Subsystem for Linux	5	20363	September 19, 2022
RuntimeError: CUDA error: no kernel image is available for execution on the device Linux	29	79251	February 22, 2021

Singularity image of Relion

Related topics