I’ve been trying to get Singularity image of Relion Docker container (GPU-optimized AI, Machine Learning, & HPC Software | NVIDIA NGC) working on our HPC cluster with P100 and V100 nodes.
I was able to create the image (it’s very easy) and even to run it on my workstation with 2x 1080Ti. No problems.
But when I run it on the cluster I get
ERROR: all CUDA-capable devices are busy or unavailable in /opt/relion-sm70/src/gpu_utils/cuda_projector.cu at line 115 (error-code 46)
which is not true. Even if I run nvidia-smi through the very same image it tells me all four GPUs are free. The same error happens on P100 and V100 nodes. My workstation isn’t really much different from the cluster environment. Both are CentOS 7, the same driver version:
NVIDIA-SMI 396.26 Driver Version: 396.26
Any hints are most welcome. I can send the full output if necessary. The error happens straight after
Running CPU instructions in double precision.
- On host gpu03.prv.davros.compute.estate: free scratch space = 539 Gb.
Copying particles to scratch directory: /scratch/tmp.14368/relion_volatile/
1.30/1.30 min …(,_,">(,_,">
Estimating initial noise spectra
58/ 58 sec …
CurrentResolution= 60.2998 Angstroms, which requires orientationSampling of at least 18.9474 degrees for a particle of diameter 360 Angstroms
Oversampling= 0 NrHiddenVariableSamplingPoints= 580608
OrientationalSampling= 15 NrOrientations= 4608
TranslationalSampling= 2 NrTranslations= 21
=============================
Oversampling= 1 NrHiddenVariableSamplingPoints= 18579456
OrientationalSampling= 7.5 NrOrientations= 36864
TranslationalSampling= 1 NrTranslations= 84
=============================
Thanks in advance!