Hi,
I am trying to run an application in AMBER Molecular dynamics program on 2 CUDA cards in a parallel process. My OS is Ubuntu 10.04.4 LTS. When i checked for CUDA capable device using lspci | grep -i nvidia, i get
lspci | grep -i nvidia
14:00.0 3D controller: nVidia Corporation Device 1091 (rev a1)
15:00.0 3D controller: nVidia Corporation Device 1091 (rev a1)
the output of nvcc -V is:-
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221
when i ran nvidia-smi, i get : -
+------------------------------------------------------+
| NVIDIA-SMI 4.304.84 Driver Version: 304.84 |
|-------------------------------+----------------------+----------------------+
| GPU Name | Bus-Id Disp. | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M2090 | 0000:14:00.0 Off | 0 |
| N/A N/A P0 78W / 225W | 0% 9MB / 5375MB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M2090 | 0000:15:00.0 Off | 0 |
| N/A N/A P0 77W / 225W | 0% 9MB / 5375MB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| No running compute processes found |
+-----------------------------------------------------------------------------+
so, i guess all the CUDA capable devices are getting detected in the machine.
when i am running an AMBER application using a single GPU card (pmemd.cuda), the process is running successfully. The output of nvidia-smi is:-
Sat Jan 30 01:23:11 2016
+------------------------------------------------------+
| NVIDIA-SMI 4.304.84 Driver Version: 304.84 |
|-------------------------------+----------------------+----------------------+
| GPU Name | Bus-Id Disp. | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M2090 | 0000:14:00.0 Off | 0 |
| N/A N/A P0 184W / 225W | 26% 1396MB / 5375MB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M2090 | 0000:15:00.0 Off | 0 |
| N/A N/A P0 78W / 225W | 0% 10MB / 5375MB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 4566 pmemd.cuda 1383MB |
+-----------------------------------------------------------------------------+
But when i try to run the process in parallel using 2 GPU cards (pmemd.cuda.MPI), i am getting this error message :-
cudaGetDeviceCount failed no CUDA-capable device is detected
cudaGetDeviceCount failed no CUDA-capable device is detected
rank 1 in job 5 mambo_35283 caused collective abort of all ranks
exit status of rank 1: return code 255
I posted the problem in AMBER mailing list, but since no reply came, i guess the problem source lies somewhere in CUDA installation.
What could be wrong here?
Thanks