Issue when using multiple GPUs

KarlWilkinson85254 · April 22, 2014, 1:09pm

Hi,

I am seeing the following error. There seems to be something wrong outside my code as I am using “-ta=tesla:cc35” (which seems to be reflected in the message itself). Also, the code works on other machines in this context.

Cheers,

Karl

mpirun -np 4 ../../bin/nvidia.tesla_cc35_cuda55 > test
The accelerator does not match the profile for which this program was compiled
Current file:     /home-2/kwilkinson/ONETEP_3.5.9.11/devel/src/kinetic_mod.F90
Current function: kinetic_gpu_app2_func_batch
Current line:     669
Current region was compiled for:
NVIDIA Tesla GPU sm30 sm35
Available accelerators:
device[1]: NVIDIA Tesla GPU 1, compute capability 3.5
device[2]: NVIDIA Tesla GPU 2, compute capability 3.5
device[3]: NVIDIA Tesla GPU 3, compute capability 3.5
device[4]: NVIDIA Tesla GPU 4, compute capability 3.5
device[5]: NVIDIA Tesla GPU 5, compute capability 3.5
device[6]: NVIDIA Tesla GPU 6, compute capability 3.5
device[7]: Native X86 (CURRENT DEVICE)

MatColgrove · April 22, 2014, 6:13pm

Hi Karl,

How are you assigning MPI processes to GPUs? It looks like the code it trying to run on device 7, which is the CPU.

device[7]: Native X86 (CURRENT DEVICE)

Mat

KarlWilkinson85254 · April 22, 2014, 7:05pm

Hi Mat,

That was my first thought, and I tried to use export CUDA_VISIBLE_DEVICES=0,1,2,3 accordingly, with no joy. I am using acc_set_device_num within the code to map GPUs to MPI ranks.

I also tried using 3 MPI ranks with either 0,1,2 or 1,2,3 visible and saw the same thing in both cases, albeit with one less GPU in the error message list.

BTW, this is on the PSG cluster.

Cheers,

Karl

MatColgrove · April 23, 2014, 4:45pm

Hi Karl,

I saw your notes to Adam and it appears that you determined that this was an issue in how you were calling acc_set_device_num.

Mat