Problem while running parallel cuda process in AMBER

a_gs · January 29, 2016, 7:44pm

Hi,

I am trying to run an application in AMBER Molecular dynamics program on 2 CUDA cards in a parallel process. My OS is Ubuntu 10.04.4 LTS. When i checked for CUDA capable device using lspci | grep -i nvidia, i get

lspci | grep -i nvidia

14:00.0 3D controller: nVidia Corporation Device 1091 (rev a1)
15:00.0 3D controller: nVidia Corporation Device 1091 (rev a1)

the output of nvcc -V is:-

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221

when i ran nvidia-smi, i get : -

+------------------------------------------------------+                       
| NVIDIA-SMI 4.304.84   Driver Version: 304.84         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name                     | Bus-Id        Disp.  | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage         | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M2090              | 0000:14:00.0     Off |                    0 |
| N/A   N/A    P0    78W / 225W |   0%    9MB / 5375MB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M2090              | 0000:15:00.0     Off |                    0 |
| N/A   N/A    P0    77W / 225W |   0%    9MB / 5375MB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|  No running compute processes found                                         |
+-----------------------------------------------------------------------------+

so, i guess all the CUDA capable devices are getting detected in the machine.

when i am running an AMBER application using a single GPU card (pmemd.cuda), the process is running successfully. The output of nvidia-smi is:-

Sat Jan 30 01:23:11 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 4.304.84   Driver Version: 304.84         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name                     | Bus-Id        Disp.  | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage         | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M2090              | 0000:14:00.0     Off |                    0 |
| N/A   N/A    P0   184W / 225W |  26% 1396MB / 5375MB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M2090              | 0000:15:00.0     Off |                    0 |
| N/A   N/A    P0    78W / 225W |   0%   10MB / 5375MB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0      4566  pmemd.cuda                                          1383MB  |
+-----------------------------------------------------------------------------+

But when i try to run the process in parallel using 2 GPU cards (pmemd.cuda.MPI), i am getting this error message :-

cudaGetDeviceCount failed no CUDA-capable device is detected
cudaGetDeviceCount failed no CUDA-capable device is detected
rank 1 in job 5 mambo_35283 caused collective abort of all ranks
exit status of rank 1: return code 255

I posted the problem in AMBER mailing list, but since no reply came, i guess the problem source lies somewhere in CUDA installation.

What could be wrong here?

Thanks

Robert_Crovella · January 30, 2016, 5:52am

what is the complete MPI command line you are using to launch the amber executable?

a_gs · January 30, 2016, 11:02am

the command line to run cuda MPI process i am using is :-

mpirun -np 2 pmemd.cuda.MPI -O -i input_file -p mut.prmtop -c restart_file -o test.out -r test.rst -x test.crd

and before running this command, i set the ENV variable CUDA_VISIBLE_DEVICES to 0,1

export CUDA_VISIBLE_DEVICES=0,1

Robert_Crovella · January 30, 2016, 3:50pm

You might want to try creating a machine/host file that explicitly calls out that both MPI ranks are to be launched on the same node:

mpirun -hostfile ~/.amber.hosts.2 ...

where your ~/.amber.hosts.2 is something like:

localhost
localhost

a_gs · January 31, 2016, 1:40pm

Thanks, i tried this, but still getting the same error, now repeated twice like this :-

cudaGetDeviceCount failed no CUDA-capable device is detected
cudaGetDeviceCount failed no CUDA-capable device is detected
rank 1 in job 10  mambo_35283   caused collective abort of all ranks
  exit status of rank 1: return code 255 
rank 0 in job 10  mambo_35283   caused collective abort of all ranks
  exit status of rank 0: return code 255

i used this command:-

mpirun -machinefile ~/.amber.host.2 -np 2 pmemd.cuda.MPI -O -i input_file -p prm.prmtop -c restart.rst

I must mention that some time back, i was able to run the cuda MPI process successfully. This problem popped up suddenly and i am unable to figure out what is wrong. The CUDA and kernel version seem to be consistent, all drivers are in their places, nothing seems to be wrong with nvidia-smi and deviceQuery output.

Robert_Crovella · January 31, 2016, 3:25pm

so deviceQuery output looks OK?

What happens if you run deviceQuery as an mpi job with one process?

Topic		Replies	Views
problem with multi gpu using mpi Legacy PGI Compilers	2	2178	December 2, 2015
Sample devieQuery cuda program error in Cuda 10.0 and Centos 7 CUDA Setup and Installation	2	944	April 1, 2019
nvidia-smi reports 3 GPUs but deviceQuery reports only 2 CUDA Setup and Installation	4	2013	June 23, 2018
CUDA 10.2 on Linux: listing devices gives error 999 CUDA Setup and Installation	5	10624	July 23, 2020
CUDA NOT WORKING CUDA Setup and Installation	1	37	March 13, 2025
Multi-GPU MPI launch failing when UVM enabled Legacy PGI Compilers	5	3777	January 2, 2019
Failing to detect devices when CUDA applications are run CUDA Setup and Installation	1	1166	February 6, 2020
Two GPUs, but 2nd GPU not detected. How to fix? CUDA Setup and Installation	10	15544	January 21, 2018
CUDA+MPI = Unexplained Issues... Random Crashes, Errenous Output?!? CUDA Programming and Performance	5	3256	July 7, 2008
Nvidia-smi recognize H100 when Firmware is disable Confidential Computing cuda , ubuntu	10	502	September 11, 2024

Problem while running parallel cuda process in AMBER

cudaGetDeviceCount failed no CUDA-capable device is detected cudaGetDeviceCount failed no CUDA-capable device is detected rank 1 in job 5 mambo_35283 caused collective abort of all ranks exit status of rank 1: return code 255

Related topics

cudaGetDeviceCount failed no CUDA-capable device is detected
cudaGetDeviceCount failed no CUDA-capable device is detected
rank 1 in job 5 mambo_35283 caused collective abort of all ranks
exit status of rank 1: return code 255